Enron is a natural-gas-transmission company founded in 1985 in the US. In the 1990s, the US Congress adopted a series of laws to deregulate the sale of natural gas. This caused Enron to lose its exclusivity rights on the natural gas pipeline. During this time, Jeffrey Skilling, who was initially a consultant and later became the company’s chief operating officer, transformed Enron into a trader of energy derivatives, acting as an intermediary between natural-gas producers and their customers. Soon after, Enron became a leader in this market and made huge profits from its trades. This golden age for the company allowed them to recruit Andrew Fastow, who quickly became the chief financial officer. Moreover, they diversified their activities to include electricity, coal, paper, and steel. However, success has its limits, and in the late 1990s, the company’s profits began to shrink. Under pressure from shareholders, company executives started relying on dubious accounting practices, particularly using “mark-to-market accounting,” which allowed the company to record unrealized future gains from some trading contracts as current income, thus giving the illusion of higher current profits. In August 2001, some people at the head of the company began to worry about a possible accounting scandal due to this practice. In October 2001, the Securities and Exchange Commission began investigating Enron’s transactions. This was the starting event that led the company to bankruptcy, which officially began in December 2001.
Source Britannica Enron scandal.
The principal aim of this project is to explore the Enron’s email data set for extracting insight about the fiscal fraud investigation and bankruptcy of the company in 2001. For that have 3 data sets:
the employee list with their email address
the emails exchange from 1999 to 2002
the recipients of each emails (to, cc, bcc).
Over this study we will investigate the email exchange by the side of the sender and the recipient. This will be made at 3 levels:
without a priori, meaning all the sender and recipient
in function of the status
for some person know to be imply in the fraud in the company as well as the person found to be the most active in the email exchange.
At each level we will look at the number of email send/received over the study period and analyze the subject and text of email send/received by focus on key words attached to some topics (meeting, business, and enron event).
The different insight will are available into a shiny apps.
For that project we used several libraries listed here: For data exploration, analysis and visualization:
To display the result into the Rmarkdown report:
To create the shiny apps:
#library
library(tidyverse)
library(circlize)
library(wordcloud)
library(ggpubr)
library(patchwork)
library(gridExtra)
library(grid)
library(gtable)
library(ggbreak)
library(knitr)
library(shiny)
#dataset
load(file = "C:/Users/marie/Documents/DSTI_Cours/R_big_Data/Exam/Enron_project/Enron.Rdata")
We design a function to extract the legend which is common to several plot inside a layout to displayed it once. We won’t use it if the legend change between the plot to avoid confusing.
#function to extract the legend from each plot
get_legend <- function(p, #the plot need to be arrange on a same layout and shared the same legend
nrow=2 #the number of row where the legend will be display, by default 2
){
#override the guides to control the number of rows in legend
p_wrapped <- p + guides(
#allow to control how the legend is arrange
fill = guide_legend(nrow = nrow, byrow = TRUE),
color = guide_legend(nrow = nrow, byrow = TRUE))
#generate a temporary table with the graphical component
temp <- ggplotGrob(p_wrapped)
#extract the legend, guide-box, and store it in a list
legend <- temp$grobs[which(sapply(temp$grobs, function(x) x$name) == "guide-box")]
#return only one legend not the list of them
return(legend[[1]])
}
The aim of this part is to see :
which kind of data the different table contains
the existence of missing value and how to handle them
Description of the data set variables and dimension:
dim_employee <- dim(employeelist)
summary(employeelist)
## eid firstName lastName Email_id
## Min. : 1.00 Length:149 Length:149 Length:149
## 1st Qu.: 38.00 Class :character Class :character Class :character
## Median : 75.00 Mode :character Mode :character Mode :character
## Mean : 75.07
## 3rd Qu.:112.00
## Max. :150.00
##
## Email2 Email3 EMail4 folder
## Length:149 Length:149 Length:149 Length:149
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## status
## Employee :41
## N/A :31
## Vice President:23
## Director :14
## Manager :14
## (Other) :25
## NA's : 1
This data set contain 149 rows and 9 columns.
This data set contains employee ID (eid), the first and last name of the employee as well as their status, the email addresses for each employee, and the folder where their email are stored. In the status variable there exist missing value’s identify by R (NA) but also putting directly in the data by the set owner which are write N/A. The eid variable is identify has type numeric, status is associate with a factor type and the other variable are character type.
Display of some observations in the data frame:
kable(employeelist[1:10, ])
| eid | firstName | lastName | Email_id | Email2 | Email3 | EMail4 | folder | status |
|---|---|---|---|---|---|---|---|---|
| 13 | Marie | Heard | marie.heard@enron.com | heard-m | NA | |||
| 6 | Mark | Taylor | mark.e.taylor@enron.com | mark.taylor@enron.com | e.taylor@enron.com | taylor-m | Employee | |
| 19 | Lindy | Donoho | lindy.donoho@enron.com | ldonoho@enron.com | donoho-l | Employee | ||
| 115 | Lisa | Gang | lisa.gang@enron.com | gang-l | N/A | |||
| 129 | Jeffrey | Skilling | jeff.skilling@enron.com | jeffrey.skilling@enron.com | skilling-j | CEO | ||
| 18 | Lynn | Blair | lynn.blair@enron.com | lynnblair@enron.com | blair-l | Director | ||
| 33 | Kim | Ward | kim.ward@enron.com | kward@enron.com | ward-k | N/A | ||
| 149 | Kate | Symes | kate.symes@enron.com | ksymes@enron.com | symes-k | Employee | ||
| 52 | Kay | Mann | kay.mann@enron.com | mann-k | Employee | |||
| 21 | Keith | Holst | keith.holst@enron.com | kholst@enron.com | holst-k | Director |
By looking at the head of the data, we observed that eid is associate to numeric data type but the more adapted type seems to be factor because it is an ID for employee. In addition, the variables Email2, Email3, EMail4 contain a lot of blank.
To investigate the blank we temporary change the datatype of those variables from character to factor to see what kind of result we return for the blank observation.
kable(employeelist %>% transform(
Email2 = as.factor(Email2),
Email3 = as.factor(Email3),
EMail4 = as.factor(EMail4)
) %>% summary())
| eid | firstName | lastName | Email_id | Email2 | Email3 | EMail4 | folder | status | |
|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.00 | Length:149 | Length:149 | Length:149 | :52 | :100 | :147 | Length:149 | Employee :41 | |
| 1st Qu.: 38.00 | Class :character | Class :character | Class :character | a..shankman@enron.com : 1 | a..martin@enron.com : 1 | j..kean@enron.com : 1 | Class :character | N/A :31 | |
| Median : 75.00 | Mode :character | Mode :character | Mode :character | andrew.lewis@enron.com: 1 | andrew.h.lewis@enron.com: 1 | peter.f.keavey@enron.com: 1 | Mode :character | Vice President:23 | |
| Mean : 75.07 | NA | NA | NA | azipper@enron.com : 1 | c.germany@enron.com : 1 | NA | NA | Director :14 | |
| 3rd Qu.:112.00 | NA | NA | NA | b..sanders@enron.com : 1 | carol.stclair@enron.com : 1 | NA | NA | Manager :14 | |
| Max. :150.00 | NA | NA | NA | barbo@enron.com : 1 | dana_davis@enron.com : 1 | NA | NA | (Other) :25 | |
| NA | NA | NA | NA | (Other) :92 | (Other) : 44 | NA | NA | NA’s : 1 |
We can see that, in the Email2, Email3, and EMail4 variable don’t have missing value but they are blank character. In the Email3 and EMail4 more than the half of the value are blank, maybe those variable aren’t very helpful for the analysis. In the variable status the NA are differently declared where we have 31 values with N/A and only 1 NA. For that variable we will need to replace the N/A by real NA values to homogenized the data.
Description of the data set variables and dimension:
dim_message <- dim(message)
kable(summary(message))
| mid | sender | date | message_id | subject | |
|---|---|---|---|---|---|
| Min. : 52 | jeff.dasovich@enron.com : 6273 | Min. :0001-05-30 | 10000282.1075847198841.JavaMail.evans@thyme: 1 | Length:252759 | |
| 1st Qu.: 88565 | j.kaminski@enron.com : 5838 | 1st Qu.:2000-12-01 | 10000478.1075841161605.JavaMail.evans@thyme: 1 | Class :character | |
| Median :186421 | kay.mann@enron.com : 5100 | Median :2001-05-21 | 1000097.1075860055721.JavaMail.evans@thyme : 1 | Mode :character | |
| Mean :190260 | sara.shackleton@enron.com: 4797 | Mean :1999-04-15 | 1000099.1075858574579.JavaMail.evans@thyme : 1 | NA | |
| 3rd Qu.:279962 | tana.jones@enron.com : 4437 | 3rd Qu.:2001-10-25 | 1000115.1075852075775.JavaMail.evans@thyme : 1 | NA | |
| Max. :404927 | chris.germany@enron.com : 3686 | Max. :2044-01-04 | 1000122.1075858816233.JavaMail.evans@thyme : 1 | NA | |
| NA | (Other) :222628 | NA | (Other) :252753 | NA |
This data set contain 252759 rows and 5 columns.
Here we observed that, the mid and date variables identify as a numeric, the variables sender and message_id are attached to factor data type, and the variable subject is character data type.
Display of some observations in the data frame:
kable(message[1:10, ])
By looking at the head of the data we observed that, the mid don’t look like numeric data but more has identifier like the eid variable in the employeelist table. In the data frame the date variable is associate to a date type. More over it seems that the observation in the subject variable are repeat several time suggesting they aren’t individual string but more a categorical variable.
Because the description seems to treat the variable date as a numeric type but the observation look like real date in the data display above we check with the class() function if R treat it correctly by evaluating if his data type is Date:
class(message$date) == "Date"
## [1] TRUE
The result confirm us R treat the date variable in the good data type meaning Date type. For this variable it is not necessary to adapt the data type.
In the date variable the min and max values return are strange date. In the introduction we saw that the data cover the period between 1999 and 2002 and those value aren’t in that period.
To understand what is those values we filter the table to get the year is less than 1999 or more than 2002:
kable(message %>%
select(date) %>% #keep the date variable
mutate(year = format(date,"%Y")) %>% #extract the year from the date
filter((year < 1999) | (year > 2002)) %>% #keep the value below and after the study's period
group_by(year) %>% count()) #count the number of rows per date out of the study's period
| year | n |
|---|---|
| 0001 | 205 |
| 0002 | 53 |
| 1979 | 6 |
| 1997 | 1 |
| 1998 | 85 |
| 2004 | 53 |
| 2007 | 1 |
| 2020 | 2 |
| 2043 | 1 |
| 2044 | 3 |
In filtering the strange date we can see that some aren’t date (0001, 0002) and the other are out of the study’s period. This represent average 450 values which makes less than 1% of the observations in the table.
The variable mid and message_id could be redundancy. To verify that we will count the number of distinct value for both variable to see if a mid could be attached to several message_id.
kable(message%>% select(mid, message_id) %>% #select only the variable we need.
transform(mid = as.factor(mid)) %>% #transform the mid into factor data type.
group_by(message_id) %>%
count(mid) %>% #count the number of mid per message_id group, create a n variable with the result.
filter(n != 1)) #filter to get the rows with a value different than 1.
| message_id | mid | n |
|---|
This shown that, each message_id is attached to one and only one mid and confirm to us the redundancy of the 2 variables in the data frame. To lighten the data we can choose one of them to be kept in the dataframe for the analysis.
As we saw in the table header me have email address of the email’s sender in the sender variable. Those email address are also in the employeelist where it as for most of the employee their status in the company but there are split into 4 different variable. In addition, the variable Email3 and EMail4 contain a lot of blank value. To see how we will can merge the two table we look at the correspondance between the 2 tables for the email address.
#prepared table to only check which email address in the Email_ID are also in the sender
employee_merge1 <- employeelist %>% mutate(sender = Email_id) %>% select(sender)
employee_merge2 <- employeelist %>% mutate(sender = Email2) %>% select(sender)
employee_merge3 <- employeelist %>% mutate(sender = Email3) %>% select(sender)
employee_merge4 <- employeelist %>% mutate(sender = EMail4) %>% select(sender)
#to do the join only with the sender variable
message_merge <- message %>% select(sender)
#first between the sender in the message table and the Email_id in the employeelist
EmailID_sender1 <- inner_join(message_merge, employee_merge1, by = "sender")
EmailID_sender1 %>% count()
## n
## 1 104766
#between the sender in the message table and the Email2 in the employeelist
EmailID_sender2 <- inner_join(message_merge, employee_merge2, by = "sender")
EmailID_sender2 %>% count()
## n
## 1 0
#between the sender in the message table and the Email3 in the employeelist
EmailID_sender3 <- inner_join(message_merge, employee_merge3, by = "sender")
EmailID_sender3 %>% count()
## n
## 1 1170
#between the sender in the message table and the EMail4 in the employeelist
EmailID_sender4 <- inner_join(message_merge, employee_merge4, by = "sender")
EmailID_sender4 %>% count()
## n
## 1 0
By using the inner_join we can see that, in the employeelist table only the variable Email_id and Email3 have email address which are also in the sender variable of the message table. If we want to get the status of the employee status attached to the sender email address we need to do the merge with those variable.
Description of the data set variables and dimension:
dim_recipient <- dim(recipientinfo)
summary(recipientinfo)
## rid mid rtype
## Min. : 67 Min. : 52 BCC: 253713
## 1st Qu.: 718289 1st Qu.:105438 CC : 253735
## Median :1515296 Median :198263 TO :1556994
## Mean :1543862 Mean :196168
## 3rd Qu.:2309682 3rd Qu.:280673
## Max. :3242063 Max. :404927
##
## rvalue
## no.address@enron.com : 19198
## jeff.dasovich@enron.com : 11137
## richard.shapiro@enron.com: 11015
## steven.j.kean@enron.com : 10873
## james.d.steffes@enron.com: 10615
## tana.jones@enron.com : 9781
## (Other) :1991823
This data set contain 2064442 rows and 4 columns. The summary of the data reveal that, the rid and mid are consider as numeric variable by R and the variables rtype and rvalue are consider as factor data type.
Display of some observations in the data frame:
| rid | mid | rtype | rvalue |
|---|---|---|---|
| 67 | 52 | TO | all.worldwide@enron.com |
| 68 | 53 | TO | all.downtown@enron.com |
| 69 | 54 | TO | all.enron-worldwide@enron.com |
| 70 | 55 | TO | all.worldwide@enron.com |
| 71 | 56 | TO | all_enron_north.america@enron.com |
| 72 | 56 | TO | ec.communications@enron.com |
| 73 | 57 | TO | charlotte@wptf.org |
| 74 | 58 | TO | sap.mailout@enron.com |
| 75 | 59 | TO | robert.badeer@enron.com |
| 76 | 60 | TO | tim.belden@enron.com |
By looking at the head of this dataset we can see that rid and mid are identifier, with the result return by the summary function we need to transform those variables into factor data for having in the good type. Also, the mid variable is a foreign key allowed to link this table with the message table. Binding together this 2 table will allow us to have the sender and the receiver of the email as well as which type of receiver (direct with the to or “indirect” with the CC and BCC). The last variable rvalue is the email address of the receiver which can be general (e.g., all.worldwide@enron.com, see in the head of the table) or specific to a person (e.g., jeff.dasovich@enron.com, see as the top specific receiver in the summary of that table). The specific email address in the rsender variable can be find in the email addresses in the employeelist variable related to the email address of each employee to get their status in the company. We proceed as with the message table.
#prepared table to only check which email address in the Email_ID are also in the sender
employee_merge1 <- employeelist %>% mutate(rvalue = Email_id) %>% select(rvalue)
employee_merge2 <- employeelist %>% mutate(rvalue = Email2) %>% select(rvalue)
employee_merge3 <- employeelist %>% mutate(rvalue = Email3) %>% select(rvalue)
employee_merge4 <- employeelist %>% mutate(rvalue = EMail4) %>% select(rvalue)
#to do the join only with the sender variable
recipient_merge <- recipientinfo %>% select(rvalue)
#first between the rvalue in the recipient table and the Email_id in the employeelist
EmailID_recipient1 <- inner_join(recipient_merge, employee_merge1, by = "rvalue")
EmailID_recipient1 %>% count()
## n
## 1 361234
# between the rvalue in the recipient table and the Email2 in the employeelist
EmailID_recipient2 <- inner_join(recipient_merge, employee_merge2, by = "rvalue")
EmailID_recipient2 %>% count()
## n
## 1 0
#between the rvalue in the recipient table and the Email3 in the employeelist
EmailID_recipient3 <- inner_join(recipient_merge, employee_merge3, by = "rvalue")
EmailID_recipient3 %>% count()
## n
## 1 2382
#first between the rvalue in the recipient table and the EMail4 in the employeelist
EmailID_recipient4 <- inner_join(recipient_merge, employee_merge4, by = "rvalue")
EmailID_recipient4 %>% count()
## n
## 1 0
Like in the message table, we only have match between the rvalue and the Email_id and Email3 variable.
Description of the data set variables and dimension:
dim_reference <- dim(referenceinfo)
summary(referenceinfo)
## rfid mid reference
## Min. : 2 Min. : 79 Length:54778
## 1st Qu.:14305 1st Qu.: 60580 Class :character
## Median :30987 Median :178176 Mode :character
## Mean :30860 Mean :179738
## 3rd Qu.:46728 3rd Qu.:275557
## Max. :63024 Max. :404920
This data set contain 54778 rows and 3 columns.
the summary pointed that, the variable rfid and mid are qualified as numeric type and the reference variable as a character type.
Display of some observations in the data frame:
kable(referenceinfo[5:10, ])
| rfid | mid | reference | |
|---|---|---|---|
| 5 | 14 | 845 | From: Monaco, John [EM] [mailto:john.monaco@citi.com]Sent: Thursday, March 07, 2002 6:40 AMTo: Badeer, RobertSubject: FW: RE: Whats up!!!!!Still around!!!!—–Original Message—–From: enron.mailsweeper.admin@enron.com[mailto:enron.mailsweeper.admin@enron.com] Sent: Thursday, March 07, 2002 9:36 AMTo: Monaco, John [EM]Subject: RE:RE: Whats up!!!!!The enron.com recipient(s)rbadeer@exchange.enron.comhave moved to a new organization. The new email address follows the formatfirstname.lastname@ubswenergy.com orfirstname.initial.lastname@ubswenergy.com (as per their original enron.comemail address). Email sent to recipient(s) at enron.com will not bedelivered. |
| 6 | 15 | 846 | From: Rangel, Ina Sent: Thursday, March 07, 2002 8:11 AMTo: Badeer, RobertSubject: Expense ReceiptsBob:I received your expense receipts today. Will submit them today.Ina Rangel |
| 7 | 16 | 847 | From: Grigsby, Mike Sent: Friday, March 08, 2002 9:08 AMTo: Badeer, RobertSubject: RE: BADGEGo with Ina —–Original Message—–From: Badeer, Robert Sent: Friday, March 08, 2002 11:08 AMTo: Grigsby, MikeSubject: RE: BADGEGrigs, Ina said it would be on the 5th floor of the new building. Which is right? —–Original Message—–From: Grigsby, Mike Sent: Friday, March 08, 2002 6:46 AMTo: Badeer, RobertSubject: BADGEYour badge will be waiting for you at the front desk in the north tower on mon. if not, then call and we will retrieve you.Michael D. Grigsby, Executive DirectorUBS Warburg Energy, LLCWork: 713-853-7031Mobile: 713-408-6256 |
| 8 | 17 | 848 | From: Grigsby, Mike Sent: Friday, March 08, 2002 6:46 AMTo: Badeer, RobertSubject: BADGEYour badge will be waiting for you at the front desk in the north tower on mon. if not, then call and we will retrieve you.Michael D. Grigsby, Executive DirectorUBS Warburg Energy, LLCWork: 713-853-7031Mobile: 713-408-6256 |
| 9 | 18 | 849 | From: Rangel, Ina Sent: Thursday, March 07, 2002 12:56 PMTo: Badeer, RobertSubject: FW: Badge AccessWhen you get here on Monday morning, come to the 5th floor reception of the new building. If your badge is not there, then I will come and pick you up when you get here and bring you up. Your badge will be ready Monday for sure, whether it be morning or afternoon I am not sure of.-Ina —–Original Message—–From: Curless, Amanda Sent: Thursday, March 07, 2002 2:50 PMTo: Rangel, InaSubject: RE: Badge AccessIna,We can most likely have this by Monday morning and he can pick this up at the 5th floor reception. If he has any problems he can call me. Thanks!Mandy —–Original Message—–From: Rangel, Ina Sent: Thursday, March 07, 2002 2:39 PMTo: Curless, AmandaSubject: RE: Badge Access << File: Badge Access Form.doc >> I filled out all of the information that I had on him. Will he be able to have his badge by Monday morning and where will he go to pick it up.Ina —–Original Message—–From: Curless, Amanda Sent: Thursday, March 07, 2002 2:00 PMTo: Rangel, InaSubject: Badge Access << File: Badge Access Form.doc >> Ina,Pleae fill out and return to me at ECS 05848. You can e-mail this to me if this is easier. Thanks!Mandy |
| 10 | 19 | 851 | From: Hyatt, Kevin Sent: Wednesday, July 25, 2001 1:00 PMTo: Nielsen, JeffSubject: RE: Mid 4 to Mid 3 QuoteJeff, can you fill in the rates for the 5,7, and 10 year terms for me. These would be notional of course. Let me know if you have questions.thxKevin 713-853-5559 Term/yrs. 2 5 7 10 Demand: Firm* $.02 - .03 $.04-.05 $.06-.07 $.07-.08 TI $.035 - .045 \(.065-\).075 $.075-.085 $.095-.105 Volume is min. 0 to max of 200,000/d * plus minimum commodity Primary to El Paso Waha would be slightly higher Rates are plus fuel —–Original Message—–From: Nielsen, Jeff Sent: Monday, July 23, 2001 4:39 PMTo: Hyatt, KevinSubject: Mid 4 to Mid 3 QuoteKevin,Jo Williams said that you needed a quote for transportation from Mid 4 to Mid 3 in the Waha area. On a firm basis we would be would in the $.02 to $.03 demand range plus minimum commodity. For a TI rate use between $.035 and $.045. If you would like primary to El Paso Waha, that rate would be a little higher. We have been able to get additional value out of that interconnect because of the gas prices in California. Please let me know if you need any additional information.Jeff 402-398-7434 |
By looking at the head of that table we can see that:
the rfid and mid aren’t numeric variable but look like identifier. It will be necessary to change their data type for factor for it be better adapted.
the reference in the referenceinfo table is a variable describing the content of each message. It has also the mid variable which allow us to merge that table with the message and/or the recipientinfo table.
in the message and recipientinfo table we have email address like in the employeelist info. We could thinks that, this table can be merged through this.
By exploring those data set we identify some issues needed to be handle before the analysis such as data type change, missing values handling, variable redundancy, and data set merging.
We choose to :
Change the data type of the identifier variable in the different table from numeric to factor.
Change the data type of the subject variable from character to factor.
Withdraw the message_id variable in the message table to lighten the dataset. In addition we drop the lines for which the date aren’t in the study’s period (from 1999 to 2002) and the strange date.
Withdraw the variable Email2 and EMail4 variable in the employeelist table because they doesn’t match with the email address in the message and recipientinfo table.
Even the referenceinfo table isn’t exhaustive because it contain only 54,778 observation which makes only 2% of the recipientinfo table. We will can analyse a few part of the email exchange.
Creates a table which bind all the information about the message by merging together the table message, referenceinfo and recipientinfo through the mid foreign key.
We choose to keep the NA in the status for the sender and the receiver. This will allow us to have all the information about the exchange. If they are drop we could loose informations.
employeelist_2 <- employeelist %>%
select(-c(Email2, EMail4)) %>% #the variable we don't need in the data
transform(eid = as.factor(eid)) %>% #data type change for the variable eid to factor
mutate(status = if_else((status == "N/A"), NA, status)) #homogenized the declaration of the NA in the variable status
Description of the new table employee list:
summary(employeelist_2)
## eid firstName lastName Email_id
## 1 : 1 Length:149 Length:149 Length:149
## 2 : 1 Class :character Class :character Class :character
## 3 : 1 Mode :character Mode :character Mode :character
## 4 : 1
## 5 : 1
## 6 : 1
## (Other):143
## Email3 folder status
## Length:149 Length:149 Employee :41
## Class :character Class :character Vice President:23
## Mode :character Mode :character Director :14
## Manager :14
## Trader :13
## (Other) :12
## NA's :32
Verification of the data type of the table variables:
#return the data type for every variable in the table
str(employeelist_2)
## 'data.frame': 149 obs. of 7 variables:
## $ eid : Factor w/ 149 levels "1","2","3","4",..: 13 6 19 115 129 18 33 148 52 21 ...
## $ firstName: chr "Marie" "Mark" "Lindy" "Lisa" ...
## $ lastName : chr "Heard" "Taylor" "Donoho" "Gang" ...
## $ Email_id : chr "marie.heard@enron.com" "mark.e.taylor@enron.com" "lindy.donoho@enron.com" "lisa.gang@enron.com" ...
## $ Email3 : chr "" "e.taylor@enron.com" "" "" ...
## $ folder : chr "heard-m" "taylor-m" "donoho-l" "gang-l" ...
## $ status : Factor w/ 10 levels "CEO","Director",..: NA 3 3 NA 1 2 NA 3 3 2 ...
The result from summary and the str function show us the data type change, the NA homogenized, and the suppression of the variable is done correctly. We can now used this table to pursue the analysis.
message_2 <- message %>%
select(-c(message_id)) %>% #withdraw the variable we don't need
transform(#change the data type for factor
mid = as.factor(mid),
sender = as.factor(sender),
subject = as.factor(subject)) %>%
#add the year variable in the table from the date
mutate(year = as.factor(format(date, "%Y"))) %>%
#filter to keep only the date from 1999 to 2002
filter(year %in% c(1999 : 2002)) %>% #drop the year variable which is no more useful in the data
select(-year)
recipientinfo_2 <- recipientinfo %>%
#change the variable data type for factor
transform(rid = as.factor(rid),
rvalue = as.factor(rvalue),
mid = as.factor(mid))
referenceinfo_2 <- referenceinfo %>%
#change the variable data type for factor
transform(rfid = as.factor(rfid),
mid = as.factor(mid))
In first we do it for the sender with Email_id
#prepared the employeelist table for the merge
employee_merge_final <- employeelist_2 %>%
select(Email_id, status) %>% #keep only the variables we need
mutate(status_sender = status) %>% #rename the status variable to know to who is attached the status
select(-status)
#merged with the df_message table
df_message_status <- left_join(df_message, employee_merge_final,
join_by(sender == Email_id))
#verification the merged work
df_message_status %>% filter(!is.na(status_sender)) %>% count()
## n
## 1 294291
Then we do it for the sender with Email3
#prepared the employeelist table for the merge
employee_merge_final2 <- employeelist_2 %>%
select(Email3, status) %>% #keep only the variables we need
mutate(status_sender_email3 = status) %>% #rename the status variable to know to who is attached the status
select(-status)
#merged with the df_message table
df_message_status <- left_join(df_message_status, employee_merge_final2,
join_by(sender == Email3))
#verification the merged work
df_message_status %>% filter(!is.na(status_sender_email3)) %>% count()
## n
## 1 2034
group all the sender status in to one variable
df_message_status <- df_message_status %>% mutate(
#replace the NA value in the variable by the value in the 2nd variable
status_sender = if_else((is.na(status_sender) == TRUE), status_sender_email3, status_sender)) %>% select(-status_sender_email3) #drop the variable
#verification the merged work
df_message_status %>% filter(!is.na(status_sender)) %>% count()
## n
## 1 296325
With this operation we attached 296 325 sender’s email address to their employee status.Next we the same for the recipient.
In first we do it for the recipient with Email_id
#prepared the employeelist table for the merge
employee_merge_final_recipient <- employeelist_2 %>%
select(Email_id, status) %>% #keep only the variables we need
mutate(status_recipient = status) %>% #rename the status variable to know to who is attached the status
select(-status)
#merged with the df_message table
df_message_status <- left_join(df_message_status, employee_merge_final_recipient,
join_by(rvalue == Email_id))
#verification the merged work
df_message_status %>% filter(!is.na(status_recipient)) %>% count()
## n
## 1 291737
Then we do it for the recipient with Email3
#prepared the employeelist table for the merge
employee_merge_final_recipient2 <- employeelist_2 %>%
select(Email3, status) %>% #keep only the variables we need
mutate(status_recipient_email3 = status) %>% #rename the status variable to know to who is attached the status
select(-status)
#merged with the df_message table
df_message_status <- left_join(df_message_status, employee_merge_final_recipient2,
join_by(rvalue == Email3))
#verification the merged work
df_message_status %>% filter(!is.na(status_recipient_email3)) %>% count()
## n
## 1 2382
group all the recipient status in to one variable
df_message_status <- df_message_status %>% mutate(
#replace the NA value in the variable by the value in the 2nd variable
status_recipient = if_else((is.na(status_recipient) == TRUE), status_recipient_email3, status_recipient)) %>%
select(-status_recipient_email3) #drop the variable
#verification the merged work
df_message_status %>% filter(!is.na(status_recipient)) %>% count()
## n
## 1 294119
By doing this we identify the status of 294 119 employee receiving the email.
Now all the information we need are group in the same data frame, we look at the period which is cover by email content in the reference variable
start <- df_message %>% filter(!is.na(reference)) %>% select(date) %>%
arrange(date) %>% head(n=1)
end <- df_message %>% filter(!is.na(reference)) %>% select(date) %>%
arrange(desc(date)) %>% head(n=1)
length_email_content <- df_message %>% filter(!is.na(reference)) %>% count()
We have 268524 with the 1st message is the 1999-05-07 and the last the 2002-07-12. We will can analyse the content a part of message exchange between the Enron employee over this period.
To facilitate the analysis and lightening the data frame we withdraw the identifier columns which aren’t more useful for us and change the name of the rvalue variable for recipient to be more meaning full.
df_message_status <- df_message_status %>%
#withdraw the variable which are identifier
select(-c(mid, rfid, rid)) %>%
#change the name of the recipient email variable and drop all the space the email address could contain
mutate(recipient = gsub(" ", "", df_message_status$rvalue),
sender = gsub(" ", "", df_message_status$sender)) %>%
#order the different variable
select(date, sender, status_sender, rtype, recipient, status_recipient, subject, reference)
#cleaning of the object no more necessary in the environment
rm(employeelist, message, message_2, recipientinfo, recipientinfo_2, referenceinfo, referenceinfo_2, df_message_missing, message_merge, recipient_merge, EmailID_sender1, EmailID_sender2, EmailID_sender3, EmailID_sender4, EmailID_recipient1, EmailID_recipient2, EmailID_recipient3, EmailID_recipient4, employee_merge1, employee_merge2, employee_merge3, employee_merge4, end, start, length_email_content, employee_merge_final, employee_merge_final2, employee_merge_final_recipient, employee_merge_final_recipient2, dim_employee, dim_message, dim_recipient, dim_reference)
#in this part we will draw many plot, every will have the same theme
theme_set(theme_light())
We start to make a global picture of the cleaned data we have.
Emailcount <- count(df_message_status %>% filter(rtype == "TO") %>% distinct(sender, recipient, subject, reference))
Reply <- count(df_message_status %>% filter(str_detect(subject, "^RE:")) %>% distinct(sender, recipient, subject, reference))
emailExchangeStatus <- count(df_message_status %>% distinct(sender, status_sender, recipient, status_recipient, subject, reference) %>% filter(!is.na(status_sender)|!is.na(status_recipient)))
In this data set, we have 17501 senders and 67571 recipients. The high difference between the number of senders and recipients suggests that an email involved several people. We have 908151 different direct email exchanges where 9.82 % are replies to former emails. This suggests that most of the emails are information sent or received, with few being real exchanges between workers. Perhaps at that time, workers communicated through other means, such as the telephone. Moreover, among the total email exchanges, only for 31.44 % do we know the status of the sender or the recipient in Enron, suggesting that there are a lot of emails from external sources and/or workers with unidentified statuses. It is also possible that some emails are addressed to email lists that group several employees in the company. For those, we can’t determine the status of the workers.
enronEmailAdd <- count(df_message_status %>% filter((str_detect(sender,"@enron")) | (str_detect(recipient,"@enron"))) %>% distinct(sender, recipient))
Estimation_generalEmailAdd <- count(df_message_status %>%
#key word regularly used for general email address name and see in the sender or recipient variable
filter(str_detect(sender, "^enron|^press|^office|^all|^announcement|^communications|affair|client|contact|secur|team|comit|^west|energy") | str_detect(recipient, "^enron|^press|^office|^all|^announcement|^communications|affair|client|contact|secur|team|comit|^west|energy")))
Exchange_ext_enron <- count(
#extract the variable we need
df_message_status %>% select(date, sender, recipient, subject, reference) %>%
#count for each the sender and recipient whose have an enron email address
mutate(count_sender = if_else(str_detect(sender, "@enron"), 1, 0),
count_recipient = if_else(str_detect(recipient, "@enron"), 1,0)) %>%
#for each date and subject for each date make the sum of the sender and recipient with an enron email address
group_by(date, subject) %>% mutate(
sum_sender = sum(count_sender),
sum_recipient = sum(count_recipient)) %>% ungroup() %>%
#isolate the email exchange which not involved person with an enron email address
filter((sum_sender ==0) & (sum_recipient == 0)))
In our data set, we have 255866, which are emails sent by or addressed to an Enron email address. In fact, the Enron company possesses many client that have their own email domains, it is also possible in those email list to have spam email. This could be the reason why only an average of 30% of the email addresses in those email exchanges are with an Enron email domain. We can also estimate that in those exchanges, an average of 63879 are emails sent or addressed to a general email address that covers several different workers at Enron or one of there clients. We observed that 25212 are emails sent and addressed to people without an Enron domain in their email addresses. These exchanges represent an average of 1% of the total emails in the data set.
#count the number of email address without enron domain for the sender
c1 <- df_message_status %>% distinct(sender) %>% mutate(
count_tot_sender = n(),
count_ext_sender = if_else((!str_detect(sender, "@enron")), 1, 0),
#count_ext_recipient = if_else((!str_detect(recipient, "@enron")), 1, 0),
sum_ext_sender = sum(count_ext_sender),
pct_ext_sender = paste0(round((sum_ext_sender/count_tot_sender)*100), "%")
#sum_ext_recipient = sum(count_ext_recipient)
) %>% distinct(sum_ext_sender, pct_ext_sender)
#count the number of email address without enron domain for the recipient
c2 <- df_message_status %>% distinct(recipient) %>% mutate(
count_tot_recipient = n(),
count_ext_recipient = if_else((!str_detect(recipient, "@enron")), 1, 0),
sum_ext_recipient = sum(count_ext_recipient),
pct_ext_recipient = paste0(round((sum_ext_recipient/count_tot_recipient)*100), "%")
) %>% distinct(sum_ext_recipient, pct_ext_recipient)
#bind the both count in the same dataframe
cbind(c1, c2)
## sum_ext_sender pct_ext_sender sum_ext_recipient pct_ext_recipient
## 1 11457 65% 39313 58%
This highlights that more than half of the senders and recipients do not have an email address with an Enron domain. This suggests that the email exchanges may be more between Enron employees and the company’s clients. It is also possible that the emails are sent to or from personal email addresses of Enron employees, perhaps in the case of informal exchanges.
From this initial overview of the data, we can deduce that:
The dataset we have is not exhaustive regarding the status of employees in the company as well as the content of the emails.
A lot of exchanges are conducted with external workers. Perhaps most of the exchanges involve Enron employees where less than 10% of the emails are sent to or from addresses without an Enron domain.
It seems that few emails are real exchanges between employees, as we have few emails containing “RE:” in their subject.
• A small part of the email exchanges seems to be between people who are external to the Enron company. Although they represent a negligible part of the total dataset, we will keep them in the dataset for further analysis.
Given this, we decide to include the employees without status to avoid losing any information about the email exchanges and to keep the external email addresses for the analysis.
To explore the number of employee we have per different status, we used the employeelist2 data frame which contain the email address, the name, and the status of the enron worker.
Number of employee per status :
employeelist_2 %>% select(status) %>% #select the needed variable
group_by(status) %>% count() %>% #count the number of employee per status
ungroup() %>%
#calculate the percentage for each status
mutate(perc = `n`/sum(`n`),
labels = scales::percent(perc)) %>%
#bar chart
ggplot(aes(reorder(status, perc ,sum),perc, fill = status)) +
geom_bar(stat = "identity") +
#to invert the axis's position
coord_flip()+
#customize the theme, title and axis labels
geom_text(aes(label = labels), vjust = 0.5, size = 4) + #display the percentage for each category at the end of the corresponding bar
scale_y_continuous(labels = scales::percent_format())+
ggtitle("Percentage of employee in the employee list with a know status")+
labs(y = "Percentage (%)",x = "Employee status") +
scale_fill_brewer(palette = "Set3",
#to display the NA in grey on the graph
na.value = "grey50"
)+
theme(legend.position = "none")
The above bar chart shows us that:
Most of the employees have an ‘employee’ or ‘unknown’ status (27.48% and 21.48% respectively).
There are few lawyers (less than 1% of the total number of employees).
Surprisingly, a lot of employees have a ‘vice president’ status (an average of 15%).
There is a similar number of managers, directors, and traders in the company (an average of 9% for each).
At the head of the company, there are several CEOs, Presidents, and Managing Directors (an average of 2% for each).
After that we look at the email exchange in the period of study In first we extract from the date the month and year and put them into different variable.
df_message_status <- df_message_status %>%
mutate(year = format(date,"%Y"), #extract the year from the date
month = format(date, "%m")) %>% #extract the month from the date
transform( #to put the variable in wright type
year = as.factor(year),
month = as.factor(month))
df_message_status %>% group_by(year,month)%>%
count() %>%
ggplot(aes(month, n, group = year, color = year))+
geom_line(size = 1)+
scale_y_continuous(labels = scales::label_comma())+
labs(title = "Number of email sent/received per month by the Enron's worker",
x = "Month",
y = "Number of emails")+
scale_fill_brewer(palette = "Set3")
The above plot shows that:
For the year 1999, the email exchange is low. We find the same rate in April 2002.
Over the year 2000, the number of emails exchanges between Enron’s workers increased gradually, reaching its highest level in November 2000.
In the year 2001, we see a peak of email exchanges during April and May. This period in 2001 is when the fiscal fraud began to be discovered. Then, the number of exchanges decreased during the summer, only to peak again in October, which is also the period when the company was under SEC investigation.
The email exchanges stopped in May 2002, possibly the date when the company was completely closed. At the start of 2002 (in January and February), we still see a high number of emails exchanges. This may be due to the completion of the fiscal fraud investigation and its consequences for the company.
First of all in the df_message we count the distinct email address for the sender and recipient as well as often they appear in the table:
#count the number of disctint sender email address
sender_count <- df_message_status %>% select(sender) %>% #keep only the variable we need
distinct(sender) %>% #keep only once each email address
count() #count them
#count the number of disctint recipient email address
recipient_count <- df_message_status %>% select(recipient) %>% distinct(recipient) %>% count()
In the df_message table, we observed that there exist 67571 different email addresses for the receiver and 17501 different email addresses for the sender. The important difference between them suggests one email is addressed to several people.
To picture in the company who is the type of Enron’s worker the most active in the email exchange, we look at the number of emails sent and received by each status and then compare them.
Start with the email sent.
#compute the number of emails send per day per employee status
violin_worker <- df_message_status %>% filter(!is.na(status_sender)) %>%
group_by(date, status_sender) %>%
summarise(email_count = n(), .groups = "drop")
#violin plot
ggplot(violin_worker, aes(as.factor(status_sender), email_count, fill = as.factor(status_sender))) +
geom_violin(trim = FALSE) +
geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
ylim(c(0,250))+
stat_compare_means(method = "anova", label.y = 250, size = 4)+
labs(title = "Number of emails sent based on the status",
x = "Source",
y = "Number of emails") +
theme(legend.position = "none")
The above plot shown us that, the employee are those who send the higher number of emails in the company. The anova test show us the difference between the group is significant.
Table with the descriptive statistic for each group
#descriptive statistics between the worker status group
violin_worker %>% group_by(status_sender)%>%
summarise(
mean = mean(email_count),
sd = sd(email_count),
min = min(email_count),
Q1 = quantile(email_count, 0.25),
Q3 = quantile(email_count, 0.75),
max = max(email_count)
)
## # A tibble: 9 × 7
## status_sender mean sd min Q1 Q3 max
## <fct> <dbl> <dbl> <int> <dbl> <dbl> <int>
## 1 CEO 37.7 284. 1 3 17 4740
## 2 Director 27.7 41.4 1 3 39 298
## 3 Employee 159. 271. 1 13 186. 4085
## 4 In House Lawyer 7.29 7.12 1 2 9 35
## 5 Manager 47.9 69.0 1 11 62 1044
## 6 Managing Director 10.7 32.2 1 2 8 455
## 7 President 29.6 75.5 1 3 26 988
## 8 Trader 17.6 24.0 1 4 23 307
## 9 Vice President 74.5 116. 1 12 89.8 1014
#statistical comparison between group
pairwise.t.test(violin_worker$email_count, violin_worker$status_sender,
#adjust the p.value with bonferroni because the number of group is small
p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: violin_worker$email_count and violin_worker$status_sender
##
## CEO Director Employee In House Lawyer Manager
## Director 1.000 - - - -
## Employee < 2e-16 < 2e-16 - - -
## In House Lawyer 1.000 1.000 < 2e-16 - -
## Manager 1.000 1.000 < 2e-16 0.154 -
## Managing Director 1.000 1.000 < 2e-16 1.000 0.017
## President 1.000 1.000 < 2e-16 1.000 1.000
## Trader 1.000 1.000 < 2e-16 1.000 0.032
## Vice President 0.022 7.0e-05 < 2e-16 5.7e-05 0.047
## Managing Director President Trader
## Director - - -
## Employee - - -
## In House Lawyer - - -
## Manager - - -
## Managing Director - - -
## President 1.000 - -
## Trader 1.000 1.000 -
## Vice President 2.5e-08 8.3e-05 2.9e-09
##
## P value adjustment method: bonferroni
The tables above describe the number of emails sent per day for each status and compare each group. This confirms the first observations shown in the violin plot, where:
Employees are the group that sends the highest number of emails per day on average. Employees are also the largest group of workers in the company, which may influence this result.
After them, vice presidents and managers send the highest number of emails per day. This may be related to their roles in the company.
Previously, we pointed out that employees are the largest group in Enron’s company. To confirm that they are the most active group in terms of email sending, we rationalize the number of emails sent per day for each group in relation to the number of Enron workers per group.
#Filter to get only the worker with a knowing status
df_message_status %>% filter(!is.na(status_sender)) %>%
group_by(date, status_sender) %>%
#count the number of emails sent per day per group as well as the distinct number of worker in each group at this date
mutate(
nb_send = n(),#count for each group the total number of sender for a date
nb_sender_per_gp = n_distinct(sender) #for each status count the number of different sender email address we have for a date
) %>% ungroup()%>%
#made the ratio between the email send per day for each status and the number of distinct sender in that status for that day
mutate(ratio_nb_email = nb_send/nb_sender_per_gp) %>%
#violin box plot
ggplot(aes(status_sender, ratio_nb_email, fill = status_sender)) +
geom_violin(trim = FALSE)+
geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
stat_compare_means(method = "anova", label.y = 2500, size = 4)+
labs(title = "Number of emails sent based on the status",
subtitle = "Ratio to the number of worker per group.",
x = "Source",
y = "Ratio\n(number of workers per status/number of emails per status)")+
theme(legend.position = "none")
If we rationalize the number of emails sent per day, it seems that generally, the amount is close to zero. Maybe between 0 and 10 for the first quartile. Surprisingly, it is the CEO who sends the highest average number of emails per day, which contradicts our previous observations when looking at the raw number of emails sent per day in relation to worker status. Perhaps the violin plot suggests a significant difference between the lower and higher amounts of emails sent per day for them. The average might be higher due to some extreme values.
#Description of the email send for each status
df_message_status %>% filter(!is.na(status_sender)) %>%
group_by(date, status_sender) %>% mutate(
#count the number of sender in each group
nb_send = n(),
#count the number of distinct sender in each group
nb_sender_per_gp = n_distinct(sender)) %>%
ungroup()%>%
#make the ratio
mutate(ratio_nb_email = nb_send/nb_sender_per_gp) %>%
distinct(date,status_sender, sender, nb_send, nb_sender_per_gp, ratio_nb_email) %>%
group_by(status_sender)%>%
#description of the email send rationalize to the number of distinct sender in each status
summarise(
mean = mean(ratio_nb_email),
median = median(ratio_nb_email),
sd = sd(ratio_nb_email),
min = min(ratio_nb_email),
Q1 = quantile(ratio_nb_email, 0.25),
Q3 = quantile(ratio_nb_email, 0.75),
max = max(ratio_nb_email)
)
## # A tibble: 9 × 8
## status_sender mean median sd min Q1 Q3 max
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CEO 32.0 7 189. 1 3 15 2370
## 2 Director 12.0 7 15.6 1 3 14.5 194
## 3 Employee 23.7 16.1 25.9 1 10.7 25.7 348
## 4 In House Lawyer 7.29 5 7.12 1 2 9 35
## 5 Manager 11.3 8.43 13.0 1 5.17 13.2 201.
## 6 Managing Director 9.96 3.5 25.7 1 2 7.5 228.
## 7 President 20.3 9 59.0 1 3 18 988
## 8 Trader 7.59 5 8.03 1 2.67 9.12 81
## 9 Vice President 15.3 11.2 14.9 1 6.8 18.3 206
After rationalizing the number of emails sent per worker in the group, we can see that the average for the CEO is around 32 emails per day with a median of 7, while the average for the employees is around 23 with a median of 16, suggesting that the average for the CEO is pushed higher by some extreme values. Indeed, the maximum for the CEO is 2,370 and for the employees it is 348. This could be the reason why the CEO appears to send a higher number of emails per day. To understand why there is this extreme value, we researched the date linked to it.
To understand what happen we look closely to the CEO group and highlight the 10 higher values for the number of email send.
df_message_status %>% filter(!is.na(status_sender)) %>%
group_by(date, status_sender) %>% mutate(
nb_send = n(),
nb_sender_per_gp = n_distinct(sender)) %>% ungroup()%>%
mutate(ratio_nb_email_pctg = nb_send/nb_sender_per_gp) %>%
distinct(date,status_sender, sender, nb_send, nb_sender_per_gp, ratio_nb_email_pctg) %>%
#look especially to the CEO status
filter((status_sender == "CEO") & (ratio_nb_email_pctg == "2370"))
## # A tibble: 2 × 6
## date status_sender sender nb_send nb_sender_per_gp ratio_nb_email_pctg
## <date> <fct> <chr> <int> <int> <dbl>
## 1 2001-08-23 CEO kenneth… 4740 2 2370
## 2 2001-08-23 CEO david.w… 4740 2 2370
Effectively the maximum number of emails send by the CEO was in August, 2001 the period where the CEO start to be worried about the risk of the fiscal fraud could be discover by the fiscal authorities.
#environment cleaning
rm(jeff_stat, sender_stat, statuts_stat, p1, p2, p3, p4, violin_plot, violin_plot1, violin_plot2, violin_worker)
Now we look at the email received by each Enron’s worker status
#compute the number of email send per day per employee status
violin_worker <- df_message_status %>% filter(!is.na(status_recipient)) %>%
group_by(date, status_recipient) %>%
summarise(email_count = n(), .groups = "drop")
#violin plot
ggplot(violin_worker, aes(as.factor(status_recipient), email_count, fill = as.factor(status_recipient))) +
geom_violin(trim = FALSE) +
geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
ylim(c(0,250))+
stat_compare_means(method = "anova", label.y = 250, size = 4)+
labs(title = "Number of emails received based on the status",
x = "Source",
y = "Number of emails") +
theme(legend.position = "none")
The employee, manager, and vice president seems to be the workers group in Enron’s company who receive the higher number of emails. It seems that, the in house lawyer are those who receive the less number of emails per days. The difference between group is significant.
Descriptive statistics and comparison between groups:
#description of the email received by each status
violin_worker %>% group_by(status_recipient)%>%
summarise(
mean = mean(email_count),
median = median(email_count),
sd = sd(email_count),
min = min(email_count),
Q1 = quantile(email_count, 0.25),
Q3 = quantile(email_count, 0.75),
max = max(email_count)
)
## # A tibble: 9 × 8
## status_recipient mean median sd min Q1 Q3 max
## <fct> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <int>
## 1 CEO 11.6 6 15.3 1 2 15 197
## 2 Director 35.6 18 61.7 1 5 38 676
## 3 Employee 98.6 40 156. 1 7 122. 1333
## 4 In House Lawyer 5.64 3 8.14 1 1 6.5 62
## 5 Manager 42.2 28 53.1 1 10 55 438
## 6 Managing Director 18.0 6 30.4 1 2 18 178
## 7 President 22.9 10 32.4 1 3 29 224
## 8 Trader 39.8 12 70.6 1 3 42 538
## 9 Vice President 85.8 32 130. 1 7 122. 1140
#statistical comparison between group
pairwise.t.test(violin_worker$email_count, violin_worker$status_recipient,
#adjust the p.value with bonferroni because the number of group is small
p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: violin_worker$email_count and violin_worker$status_recipient
##
## CEO Director Employee In House Lawyer Manager
## Director 9.4e-05 - - - -
## Employee < 2e-16 < 2e-16 - - -
## In House Lawyer 1.00000 5.9e-05 < 2e-16 - -
## Manager 2.4e-08 1.00000 < 2e-16 8.7e-08 -
## Managing Director 1.00000 0.01940 < 2e-16 1.00000 3.3e-05
## President 0.86132 0.35860 < 2e-16 0.18185 0.00190
## Trader 9.8e-07 1.00000 < 2e-16 1.5e-06 1.00000
## Vice President < 2e-16 < 2e-16 0.06459 < 2e-16 < 2e-16
## Managing Director President Trader
## Director - - -
## Employee - - -
## In House Lawyer - - -
## Manager - - -
## Managing Director - - -
## President 1.00000 - -
## Trader 0.00058 0.02020 -
## Vice President < 2e-16 < 2e-16 < 2e-16
##
## P value adjustment method: bonferroni
Again, it is the employees who receive the highest number of emails per day. They show the highest mean, which is close to that of the vice presidents. In addition, the standard deviation for these two groups is significant and may overlap. This explains why the number of emails received per day for the employee group isn’t significantly higher compared to the vice president group. The employee group is the largest in the company (27% of the workforce), while the vice presidents represent only 9% of the workforce. Perhaps the reason they also receive a high number of emails is because of their position in the company. The manager group is also one of the groups that receive the highest number of emails per day. Perhaps, like the vice president group, it is because of their position in the company. After these groups, we find the traders and directors, who also receive a high number of emails per day.
Like for the email send we look if those result are confirm if we rationalize the number of emails received per day for each group in function of the number of worker in that group.
#Filter to get only the worker with a knowing status
df_message_status %>% filter(!is.na(status_recipient)) %>%
group_by(date, status_sender) %>%
#count the number of emails received per day per group as well as the distinct number of worker in each group at this date
mutate(nb_received = n(),
nb_received_per_gp = n_distinct(recipient)) %>%
ungroup()%>%
#made the ratio between the email send per day for each group and the number of worker in that group for that day
mutate(ratio_nb_email = nb_received/nb_received_per_gp) %>%
#violin box plot
ggplot(aes(status_recipient, ratio_nb_email, fill = status_recipient)) +
geom_violin(trim = FALSE)+
geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
stat_compare_means(method = "anova", label.y = 70, size = 4)+
labs(title = "Number of email received based on the status",
subtitle = "Ratio to the number of workers per group.",
x = "Source",
y = "Ratio\n(number of workers per status/number of emails per status)")+
theme(legend.position = "none")
#Description of the email received by each status rationalize to the number of distinct recipient per status
df_message_status %>% filter(!is.na(status_recipient)) %>%
group_by(date, status_sender) %>%
mutate(nb_received = n(),
nb_received_per_gp = n_distinct(recipient)) %>%
ungroup()%>%
mutate(ratio_nb_email = nb_received/nb_received_per_gp)%>%
#keep only distinct value
distinct(date,status_recipient, recipient, nb_received, nb_received_per_gp, ratio_nb_email) %>%
#make the descriptive statistics for each recipient group
group_by(status_recipient)%>% summarise(
mean = mean(ratio_nb_email),
median = median(ratio_nb_email),
sd = sd(ratio_nb_email),
min = min(ratio_nb_email),
Q1 = quantile(ratio_nb_email, 0.25),
Q3 = quantile(ratio_nb_email, 0.75),
max = max(ratio_nb_email)
)
## # A tibble: 9 × 8
## status_recipient mean median sd min Q1 Q3 max
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CEO 5.69 4.35 5.02 1 2.86 6.52 67.8
## 2 Director 6.54 4.81 5.70 1 3.33 7.83 48.7
## 3 Employee 6.19 4.56 5.60 1 3.04 7.13 67.8
## 4 In House Lawyer 6.99 5.32 5.88 1 3.71 8.25 40.9
## 5 Manager 6.26 4.76 5.38 1 3.2 7.29 67.8
## 6 Managing Director 6.35 4.46 6.32 1 2.74 7.28 67.8
## 7 President 5.34 4.12 4.79 1 2.48 6.25 56.1
## 8 Trader 7.17 5.27 6.55 1 3.51 8.43 67.8
## 9 Vice President 5.53 4.17 4.99 1 2.67 6.41 67.8
If we rationalize the number of email received by the number of worker in each group we can see it still have a significant difference between status. Perhaps, the difference between group isn’t contrasted as what is seen for the email sent. We can think that it has more worker who received email than those who sent them each day. Maybe we have a significant p-value because the large number of emails increase the statistical power, making easier to get significance.
#count the number of email send and received per day in function of their status
send_vs_received <- df_message_status %>%
group_by(date, status_sender) %>%
mutate(nb_sender_per_group = n_distinct(sender)) %>% ungroup()%>%
group_by(date, status_recipient) %>%
mutate(nb_recipient_per_group = n_distinct(recipient)) %>% ungroup()
send_vs_received <- as.data.frame(send_vs_received)
#descriptive statistic for both the sender and recipient
send_vs_received %>%
summarise(
across(c(nb_sender_per_group,nb_recipient_per_group),
list(mean = ~mean(.x),
median = ~median(.x),
sd = ~sd(.x),
min = ~min(.x),
Q1 = ~quantile(.x,0.25),
Q3 = ~quantile(.x,0.75),
max = ~max(.x))))
## nb_sender_per_group_mean nb_sender_per_group_median nb_sender_per_group_sd
## 1 206.8242 159 185.2247
## nb_sender_per_group_min nb_sender_per_group_Q1 nb_sender_per_group_Q3
## 1 1 80 281
## nb_sender_per_group_max nb_recipient_per_group_mean
## 1 1328 1248.807
## nb_recipient_per_group_median nb_recipient_per_group_sd
## 1 1168 848.2686
## nb_recipient_per_group_min nb_recipient_per_group_Q1
## 1 1 618
## nb_recipient_per_group_Q3 nb_recipient_per_group_max
## 1 1930 3145
#boxplot to vizualised the descriptive statistic
p1 <- send_vs_received %>% filter(!is.na(status_sender)) %>%
ggplot(aes(status_sender, nb_sender_per_group, fill = status_sender))+
geom_violin(trim = FALSE)+
geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
labs(title = "Number of persons who sent email per status",
x = "Source",
y = "Number of persons")+
theme(legend.position = "none")
p2 <- send_vs_received %>% filter(!is.na(status_recipient)) %>%
ggplot(aes(status_recipient, nb_recipient_per_group, fill = status_recipient))+
geom_violin(trim = FALSE)+
geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
labs(title = "Number of persons who received email per status",
x = "Source",
y = "Number of persons")+
theme(legend.position = "none")
p1/p2
We can see that, it as in average more person in a group who receive email each day compared to the number of person who send them. This is especially true for the worker in the employee, trader, vice president, and director groups.
In general, it is the employees who are more active in email exchanges. When we rationalize the number of emails sent in relation to the number of workers per group, we can see that employees are really the most active in sending emails, but at some point, the CEO group sent a high number of emails due to Enron’s events. If we look at the number of emails received in relation to the number of workers in a group, we see no real difference between the groups, suggesting that more people receive emails each day than send them
Next we take a look at the flux of the email exchange between the different status over the study period to see if it change.
We now look if along the year it as a change in the interaction between the Enron’s worker with a knowing status. For that per year we draw chord diagram which allows to follow the links between group.
#plot for each year follow the exchange between group
per_year <- df_message_status %>% select(date, status_sender, status_recipient) %>%
filter(!is.na(status_sender) & !is.na(status_recipient)) %>%
mutate(year = format(date,"%Y"),
#to enhance the clarity we group certain status with similar level of responsability together
status_sender = case_when(
status_sender %in% c("Managing Director", "Manager", "Director") ~ "Manger - Director",
status_sender %in% c("CEO", "Vice President", "President") ~ "CEO - President",
.default = status_sender),
status_recipient = case_when(
status_recipient %in% c("Managing Director", "Manager", "Director") ~ "Manger - Director",
status_recipient %in% c("CEO", "Vice President", "President") ~ "CEO - President",
.default = status_recipient)) %>%
group_by(date,status_sender, status_recipient) %>%
#count the number of email exchange for a couple of status sender/recipient per date
mutate(number_exchange = n()) %>% ungroup() %>%
distinct(date, status_sender, status_recipient, number_exchange, year)
#For each year we create a dataframe with the number of email exchange between each status
year_1999 <- as.data.frame(per_year %>% filter(year == 1999) %>%
group_by(status_sender, status_recipient) %>%
#sum for each couple for the year
mutate(sum = sum(number_exchange)) %>% ungroup() %>%
distinct(status_sender, status_recipient, sum) %>%
#keep only the exchange between different status
filter(status_sender != status_recipient) %>%
arrange(status_sender, status_recipient)
)
year_2000 <- as.data.frame(per_year %>% filter(year == 2000) %>%
group_by(status_sender, status_recipient) %>%
mutate(sum = sum(number_exchange)) %>% ungroup() %>%
distinct(status_sender, status_recipient, sum) %>%
filter(status_sender != status_recipient) %>%
arrange(status_sender, status_recipient)
)
year_2001 <- as.data.frame(per_year %>% filter(year == 2001) %>%
group_by(status_sender, status_recipient) %>%
mutate(sum = sum(number_exchange)) %>% ungroup() %>%
distinct(status_sender, status_recipient, sum) %>%
filter(status_sender != status_recipient) %>%
arrange(status_sender, status_recipient)
)
year_2002 <- as.data.frame(per_year %>% filter(year == 2002) %>%
group_by(status_sender, status_recipient) %>%
mutate(sum = sum(number_exchange)) %>% ungroup() %>%
distinct(status_sender, status_recipient, sum) %>%
filter(status_sender != status_recipient) %>%
arrange(status_sender, status_recipient)
)
#the color for each status
status_color <- c(
"Employee" = "pink",
"CEO - President" = "orange",
"Trader" = "springgreen3",
"Manger - Director" = "violetred4",
"In House Lawyer" = "purple4")
Display the chord diagram of the year 1999
adjacencyData_99 <-with(year_1999, table(status_sender, status_recipient))
chordDiagram(adjacencyData_99, transparency = 0.5, grid.col = status_color)
year 2000
adjacencyData_00 <-with(year_2000, table(status_sender, status_recipient))
chordDiagram(adjacencyData_00, transparency = 0.5, grid.col = status_color)
year 2001
adjacencyData_01 <-with(year_2001, table(status_sender, status_recipient))
chordDiagram(adjacencyData_01, transparency = 0.5, grid.col = status_color)
year 2002
adjacencyData_02 <-with(year_2002, table(status_sender, status_recipient))
chordDiagram(adjacencyData_02, transparency = 0.5, grid.col = status_color)
For the email exchange, we can see that:
In 1999, the trader exchanged emails only with employees, but later, they also exchanged with managers/directors and the CEO/president. Surprisingly, it seems the trader never exchanged directly with the in-house lawyer. Perhaps their email exchanges were indirect.
In 2002, the in-house lawyer received emails only from the manager/director. During this period, we do not see email exchanges from the in-house lawyer to other company workers with a known status. Perhaps they sent emails to external persons for managing the company’s bankruptcy with the information they received from the manager and director.
The in-house lawyer exchanged emails in 2000 only with the manager/director and the CEO/president, but in 2001, they also exchanged with employees. The change in the email flow for the in-house lawyer might be related to the Enron event, where there could have been a need to inform employees about some matters so they could respond to SEC investigations.
This last analyze highlight the change in the email flux over the study period. Some change could be linked with the Enron event.
The data set covers the email exchanges between Enron’s workers from 1999 to 2002. From 1999 to early 2001, the company was in good health. Starting in the middle of 2001, the company’s fraud became public and put the company in trouble. Through the email history, we will look at whether the number of emails sent and received changed over the months in relation to the workers’ status.
We look over the month of each year which are the worker status the most active. For the employee.
#list of status in the Enron company
status_list <- c("Employee", "CEO", "Manager", "Director", "Vice President", "Trader", "President", "Managing Director", "In House Lawyer")
month_label <- c("01" = "January","02" = "February","03" = "March","04" = "April","05" = "May","06" = "June","07" = "July","08" = "August",
"09" = "September","10" = "October","11" = "November","12" = "December")
month_color <- c("01" = "lightgreen","02" = "lightsalmon4","03" = "lightblue","04" = "greenyellow","05" = "cyan","06" = "darkgreen","07" = "lavender",
"08" = "plum","09" = "coral","10" = "honeydew4","11" = "hotpink","12" = "indianred")
#initiate the list for the plot
email_send <- list()
#loop allowing to construct a bar plot to display per month the number of email send in function of the worker status
for(i in seq(status_list)){
status <- status_list[i]
p <- df_message_status %>% filter(status_sender == status) %>% #take the value in the list
group_by(year,month)%>%
count() %>%
#bar plot
ggplot(aes(month, n, fill = month))+
geom_bar(stat = "identity") +
facet_grid(~year)+
labs(title = paste("Email sent per month for each year by the", status),
y = "Number of emails")+
scale_fill_manual(
values = month_color,
labels = month_label)+
theme(legend.position = "bottom",
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank())
email_send[[i]] <- p}
#display the plot create
n <- length(email_send)
plot_per_section <- 3
for(j in seq(1,n,by=plot_per_section)){
plot_on_the_page <- email_send[j:min(j+2, n)]
#extract the legend from the first plot on the layout
legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
#remove the legend for all plot on the layout
no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
#display 4 plots per layout
grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
#combine together the 3 plot and one legend
plots_with_legend <- arrangeGrob(
grid_plot,
legend,
nrow = 2,
#arrange the plot and the legend in the layout
heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
)
#display everything together
grid.newpage()
grid.draw(plots_with_legend)
}
By looking year by year we can see that:
It is the workers with employee status who send the highest number of emails in the different years. The number of emails sent follows the trend we observed when we look at all Enron’s workers, suggesting that the employees influence the general email exchange number per month in the company. This could be linked to the number of employees in the company. In 2001, the employee group was the one who sent the highest number of emails.
The CEO appears in the emails sent from January 2000, which is the moment their role is formally declared in the company. They send a high number of emails compared to directors and managing directors. Especially in the year 2001, in April, May, October, and November, they send an important number of emails. This may be related to the fiscal fraud investigation.
In the year 2001, the number of emails sent by the in-house lawyer is the highest compared to the other years, suggesting they are involved in managing the fiscal fraud investigation inside the company.
The traders are the third group who send a high number of emails per month, which is logical given the company’s activity.
Now we look for the email receive in function of the Enron’s worker status.
#initiate the list for the plot
email_received <- list()
#loop allowing to construct a bar plot to display per month the number of email send in function of the worker status
for(i in seq(status_list)){
status <- status_list[i]
p <- df_message_status %>% filter(status_recipient == status) %>% #take the value in the list
group_by(year,month)%>%
count() %>%
#bar plot
ggplot(aes(month, n, fill = month))+
geom_bar(stat = "identity") +
facet_grid(~year)+
labs(title = paste("Email received per month for each year by the", status),
y = "Number of emails")+
scale_fill_manual(
values = month_color,
labels = month_label)+
theme(legend.position = "bottom",
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank())
email_received[[i]] <- p}
#display the plot create
n <- length(email_received)
plot_per_section <- 3
for(j in seq(1,n,by=plot_per_section)){
plot_on_the_page <- email_received[j:min(j+2, n)]
#extract the legend from the first plot on the layout
legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
#remove the legend for all plot on the layout
no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
#display 4 plots per layout
grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
#combine together the 3 plot and one legend
plots_with_legend <- arrangeGrob(
grid_plot,
legend,
nrow = 2,
#arrange the plot and the legend in the layout
heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
)
#display everything together
grid.newpage()
grid.draw(plots_with_legend)
}
The plot above shows that: - Like for the email sent, it is the employees who receive the highest number. They follow the same trend as we saw for the emails sent, suggesting they are active in email exchanges in general.
The traders seem to receive more emails than they send.
For the group at the head of the company (CEO, Managing Director, Director, President, and Vice President), the number of emails received follows the Enron’s fiscal fraud event with high peaks in April, May, October, and November of 2001.
In 2001, the Vice President group received a lot of emails compared to the other head groups of the company.
In 2001, the in-house lawyer group seemed to receive the highest number of emails.
#environment cleaning
rm(jeff_stat, recipient_stat, statuts_stat, violin_plot, violin_plot1, violin_plot2, violin_worker, p1, p2, send_vs_received)
Now we try to see who is the most active in the email exchange. For that, we start by counting the number of email send per each worker and return the 10 persons who send the highest number.
#Display the top 10 email address of sender
p1 <- df_message_status %>%
#keep distinct exchange
distinct(sender, subject, recipient, .keep_all = TRUE) %>%
group_by(sender)%>% count() %>% #to count the number of email send per email address
ungroup() %>%
#calculate the percentage for each sender
mutate(perc = round(`n`/sum(`n`),3),
labels = scales::percent(perc)) %>%
arrange(desc(n)) %>% head(10) %>% #to get only the 10 email address with the most important number of email send
#bar chart
ggplot(aes(reorder(sender, perc, sum), perc, fill = sender)) +
geom_bar(stat="identity") +
coord_flip() +
#graph title and label
geom_text(aes(label = labels), vjust = 0.5, size = 4) + #display the percentage for each category at the end of the corresponding bar
scale_y_continuous(labels = scales::percent_format())+
labs(title = "Top 10 Enron's employee email sender")+
xlab("Employee's email address")+
ylab("Email sent per sender (%)") +
scale_fill_brewer(palette = "Set3")+
theme(legend.position = "none",
plot.margin = margin(10, 10, 10, 20))
#Display the top 10 email address of recipient
p2 <- df_message_status %>% filter(rtype == "TO") %>% #select only the email of the direct concerned receiver
distinct(sender, recipient, subject, .keep_all = TRUE) %>%
group_by(recipient)%>% count() %>% #to count the number of email send per email address
ungroup() %>%
#calculate the percentage for each sender
mutate(perc = round(`n`/sum(`n`),4),
labels = scales::percent(perc)) %>%
arrange(desc(n)) %>% head(10) %>% #to get only the 10 email address with the most important number of email send
#bar chart
ggplot(aes(reorder(recipient, perc, sum), perc, fill = recipient)) +
geom_bar(stat="identity") +
coord_flip() +
#graph title and label
geom_text(aes(label = labels), vjust = 0.5, size = 4) + #display the percentage for each category at the end of the corresponding bar
scale_y_continuous(labels = scales::percent_format())+
labs(title = "Top 10 Enron's employee email receiver",
subtitle = "Only principal receiver")+
xlab("Employee's email address")+
ylab("Email received per recipient (%)") +
scale_fill_brewer(palette = "Set3")+
theme(legend.position = "none",
plot.margin = margin(10, 10, 10, 20))
#arrange the plot on the same place
p1 / p2
Jeff Dasovitch seems to be the most active worker in Enron for email exchange where for the period of study it’s him who send the higher proportion of email (3.1%) and received the highest proportion (0.61%).
#return only one result from that query to get the status of the most active sender/recipient
head(df_message_status[df_message_status$sender == "jeff.dasovich@enron.com", "status_sender"],
n=1)
## [1] Employee
## 10 Levels: CEO Director Employee In House Lawyer Manager ... Vice President
In the employee data set he is described to be an Employee of Enron. To see if it is really the most active we will compared the number of email send and received by him to the other worker with the same status (Employee) and to all the worker of Enron company.
For that we will compute descriptive comparative statistic between them.
#count the number of email send by jeff dasovich per day
jeff_stat_send <- df_message_status %>% filter(sender == "jeff.dasovich@enron.com") %>%
#we count the number of different email subject send per day
group_by(date, subject) %>%
summarise(email_count = n(), .groups = "drop") %>%
mutate(source = "Jeff Dasovich") %>% transform(source = as.factor(source))
#count the number of email send by all sender per day
sender_stat <- df_message_status %>%
#we count the number of different email subject send per day by each sender
group_by(date, sender, subject) %>%
summarise(email_count = n(), .groups = "drop") %>%
mutate(source = "All sender") %>% select(-sender) %>% transform(source = as.factor(source))
#count the number of email send by Employee status per day
statuts_stat_send <- df_message_status %>% filter(status_sender == "Employee") %>%
#we count the number of different email subject send per day by each sender of status employee
group_by(date, sender, subject) %>%
summarise(email_count = n(), .groups = "drop") %>%
mutate(source = "Employee status") %>% transform(source = as.factor(source))
#combine the rows together to create a unique dataframe and compared the enron's worker and the employee to Jeff
violin_plot1 <- bind_rows(jeff_stat_send, statuts_stat_send)
violin_plot2 <- bind_rows(jeff_stat_send, sender_stat)
#compared the 2 groups per a t.test to see if jeff dasovitch is really most active than the other employee
p3 <- ggplot(violin_plot1, aes(as.factor(source), email_count, fill = as.factor(source))) +
geom_violin(trim = FALSE) +
geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
#display the comparative statistic on the violin plot
stat_compare_means(method = "t.test", label.y = max(violin_plot1$email_count) - 400)+
labs(title = "Comparison of the email sent between
Jeff Dasovitch and the Enron's Employee",
x = "Source",
y = "Number of emails") +
#to better see the violin plot we break the y axis
scale_y_break(c(100, 3000), scales = 0.3)+
#set up the color for each resources
scale_fill_manual(values = c(
"Jeff Dasovich" = "tomato2",
"Employee status" = "yellowgreen"))+
#withdraw the legend form the plot
theme(legend.position = "none")
#same plot but to compared Jeff Dasovitch to the Enron's worker
p4 <- ggplot(violin_plot2, aes(as.factor(source), email_count, fill = as.factor(source))) +
geom_violin(trim = FALSE) +
geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
stat_compare_means(method = "t.test", label.y = max(violin_plot2$email_count) - 2000)+
scale_y_break(c(250, 15000), scales = 0.3)+
labs(title = "Comparison of the email sent between
Jeff Dasovitch and all sender",
x = "Source",
y = "Number of emails") +
scale_fill_manual(#set up the color for each resources
values = c(
"Jeff Dasovich" = "tomato2",
"All sender" = "cyan"))+
theme(legend.position = "none")
#arrange the plot on the same place
p3 + p4
#display the stat of the different group
violin_plot <- bind_rows(jeff_stat_send, sender_stat, statuts_stat_send)
#Description of the email send by Jeff Dasovich, the Employee, and all
violin_plot %>% group_by(source)%>%
summarise(
mean = mean(email_count),
sd = sd(email_count),
min = min(email_count),
Q1 = quantile(email_count, 0.25),
Q3 = quantile(email_count, 0.75),
max = max(email_count)
)
## # A tibble: 3 × 7
## source mean sd min Q1 Q3 max
## <fct> <dbl> <dbl> <int> <dbl> <dbl> <int>
## 1 Jeff Dasovich 15.6 45.7 1 1 9 760
## 2 All sender 10.6 80.6 1 1 5 18445
## 3 Employee status 5.49 29.9 1 1 3 3556
The table summarizing the emails sent by the group shows us that:
It is Jeff Dasovitch who has the highest average number of emails sent per day. The lowest is for the Enron employees.
By looking at the quantiles, which represent respectively the 25% and the 75% of the values, it is also Jeff who has the highest value for quantile 3, especially compared to the Enron employees.
Surprisingly, when we look at all the senders, we find the highest number of emails sent in a day. Maybe that is linked to the Enron event.
From this we can deduce that, Jeff Dasovitch is significantly the most active Enron’s worker in the email sending.
Then we look at the email received by Jeff Dasovitch compared to Enron’s worker of the same status and to all Enron’s worker.
#statistics on the jeff dasovich email receive per day
jeff_stat_rec <- df_message_status %>% filter(recipient == "jeff.dasovich@enron.com") %>%
group_by(date) %>%
summarise(email_count = n(), .groups = "drop") %>%
mutate(source = "Jeff Dasovich") %>% transform(source = as.factor(source))
#statistics on the email send per day by all the recipient
recipient_stat <- df_message_status %>% group_by(date, recipient) %>%
summarise(email_count = n(), .groups = "drop") %>%
mutate(source = "Enron's worker") %>% select(-recipient) %>% transform(source = as.factor(source))
#statistics on the email send per day by the enron's worker who have an employee statuts
statuts_stat_rec <- df_message_status %>% filter(status_recipient == "Employee") %>% group_by(date) %>%
summarise(email_count = n(), .groups = "drop") %>%
mutate(source = "Employee status") %>% transform(source = as.factor(source))
#combine the rows together to create a unique dataframe and compared the enron's worker and the employee to Jeff
violin_plot1 <- bind_rows(jeff_stat_rec, statuts_stat_rec)
violin_plot2 <- bind_rows(jeff_stat_rec, recipient_stat)
#compared the 2 groups per a t.test to see if jeff dasovitch is really most active than the other employee and/or worker in Enron's company
p3 <- ggplot(violin_plot1, aes(as.factor(source), email_count, fill = as.factor(source))) +
geom_violin(trim = FALSE) +
geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
#compared statisticaly the 2 group to see if the difference is significant or not
stat_compare_means(method = "t.test", label.y = max(violin_plot1$email_count) + 2)+
labs(title = "Comparison of the email received between
Jeff Dasovitch and the Enron's Employee",
x = "Source",
y = "Number of emails") +
theme(legend.position = "none")+
scale_fill_manual(#set up the color for each resources
values = c(
"Jeff Dasovich" = "tomato2",
"Employee status" = "yellowgreen"
))
p4 <- ggplot(violin_plot2, aes(as.factor(source), email_count, fill = as.factor(source))) +
geom_violin(trim = FALSE) +
geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
ylim(c(-10,350))+
stat_compare_means(method = "t.test", label.y = 300)+
labs(title = "Comparison of the email received between
Jeff Dasovitch and all recipient",
x = "Source",
y = "Number of emails") +
theme(legend.position = "none")+
scale_fill_manual(#set up the color for each resources
values = c(
"Jeff Dasovich" = "tomato2",
"All recipient" = "cyan"
))
#arrange the plot on the same place
p3 + p4
violin_plot <- bind_rows(jeff_stat_rec, recipient_stat, statuts_stat_rec)
#Description of the email received
violin_plot %>% group_by(source) %>%
summarise(
mean = mean(email_count),
median = median(email_count),
sd = sd(email_count),
min = min(email_count),
Q1 = quantile(email_count, 0.25),
Q3 = quantile(email_count, 0.75),
max = max(email_count)
)
## # A tibble: 3 × 8
## source mean median sd min Q1 Q3 max
## <fct> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <int>
## 1 Jeff Dasovich 17.5 10 19.2 1 3 25 113
## 2 Enron's worker 3.19 2 6.36 1 1 3 1153
## 3 Employee status 98.6 40 156. 1 7 122. 1333
When we look at the number of emails received, Jeff Dasovich received significantly more emails on average than another Enron worker. However, when compared to other employees, he did not receive more emails than some others. On the contrary, he received significantly fewer than some. For the Enron workers, the mean is far from the median, suggesting that extreme values exist within that group. The violin plot for the employees highlights this, where we can see that above the 3rd quartile, there is a long tail starting around 120 and becoming extremely thin after 250. Conversely, for Jeff Dasovich’s violin plot, above the 3rd quartile, the tail does not become finer but seems to consistently have a significant number of observations with these high values. All of this suggests that for the employees, some events caused them to receive an extremely high number of emails, a peak that is not seen for Jeff Dasovich.
From this part of the analyze we can say that:
- Jeff Dasovich is the Enron worker who send and received the highest number of email.
- Compared to other worker with an employee status he significantly send more email but he received less.
- It is possible that, it has some events whose made other employee than Jeff Dasovich receiving more email in one day. We could thing Jeff Dasovich is one of the employee who receive the most email per day but not the only one.
We can conclude that Jeff Dasovich is more active than passive in the email exchange and is the sender with the highest number of emails per day during the study period.
In our data set we have 2063706 rows with email content which represent 10%. This make the email content is few exhaustive compared to the email subject which is describe for every email exchange.
String_var_stat <- df_message_status %>% distinct(reference, subject) %>% mutate(
emailTextLength = str_count(reference,
#specify in regex we want to count the number of word or sequence of character without space between them
"\\S+"),
emailSubjectLength = str_count(subject,"\\S+"))
summary(String_var_stat)
## reference subject emailTextLength emailSubjectLength
## Length:157194 RE: : 2744 Min. : 0.0 Min. : 0.000
## Class :character FW: : 585 1st Qu.: 71.0 1st Qu.: 3.000
## Mode :character RE: Hello: 82 Median : 147.0 Median : 4.000
## RE: Hi : 56 Mean : 244.9 Mean : 4.899
## : 52 3rd Qu.: 288.0 3rd Qu.: 6.000
## RE: Lunch: 48 Max. :10153.0 Max. :49.000
## (Other) :153627 NA's :110536
In average the email text contain 245 words and the subject 30. We have 52 subject which are blank, most of the subject only contain RE: or FW:, for both the original subject is hidden. This suggest the top email subject is reply to another email and email transfer between worker.
To investigate the subject and text of the emails we have, we created 4 lists of different topics which will be researched in the email subject:
Emails related to meetings by looking for words such as message, please, email, inform.
Emails related to business processes and business legalities such as enron, deal, change, corp, date, america.
Emails related to the core business of Enron like gas, power, trade.
These keywords come from the wikipedia page about Enron timeline downfall. Each word/concept will be researched individually in the email content to follow the email exchanges containing them as well as the Enron workers’ status implied in those exchanges.The analysis is conducted over the study period to highlight periods where these topics/keywords are more used by the Enron workers. Then we will look if there are worker statuses that used them more than others to finally look at some specific Enron workers known to be involved in the Enron events.
#topics list
topic_meeting <- c("message|origin|pleas|email|thank|attach|file|copi|inform|receiv|thank|all|time|meet|look|week|day|dont|vinc|talk")
topic_business_process <- c("enron|deal|agreement|chang|contract|corp|fax|houston|date|america|risk|analy|confidential|correction")
topic_core_business <- c("market|gas|price|power|company|energy|trade|busi|servic|manag")
topic_enron_event <- c("bankrup|SEC|MTM|fear|losing money|10-K|fears|investigation|phone|fax|document|testimony|witness|deposition")
#construction of the data set for measuring the frequency of the different topic in the email subject as well as the number of email with specific word, we focus on the sender status
email_subject_send <- df_message_status %>% distinct(date, year, month, sender, status_sender, subject, reference) %>%
mutate(#count the number of email which contain at least one word in the list of each topic
subject_meeting = if_else(str_detect(subject, topic_meeting), 1, 0),
subject_business_process = if_else(str_detect(subject, topic_business_process), 1, 0),
subject_core_business = if_else(str_detect(subject, topic_core_business), 1, 0),
subject_enron_event = if_else(str_detect(subject, topic_enron_event), 1, 0),
email_meeting = if_else(str_detect(reference,topic_meeting), 1, 0),
email_business_process = if_else(str_detect(reference, topic_business_process), 1, 0),
email_core_business = if_else(str_detect(reference, topic_core_business), 1, 0),
email_enron_event = if_else(str_detect(reference, topic_enron_event), 1, 0),
#to get the date in year/month
year_month = as.Date(paste0(year,"-",month,"-01")))
In the following part we will create plot which will represent the email exchange about specific topics. To homogenized the apparent of those plot we declared a color and a label for each category for they can be apply at every plot.
#the list of category studied and their related color in each plot
topic_colors <- c("sum_subject_business_process" = "steelblue4",
"sum_subject_core_business" = "orchid",
"sum_subject_meeting" = "chocolate4",
"sum_subject_enron_event" = "yellowgreen",
"sum_email_business_process" = "cyan3",
"sum_email_core_business" = "plum4",
"sum_email_meeting" = "salmon",
"sum_email_enron_event" = "springgreen4")
#the list of category and their related label on the plot
topic_label <- c("sum_subject_business_process" = "Business process email subject",
"sum_subject_core_business" = "Core Business email subject",
"sum_subject_meeting" = "Meeting email subject",
"sum_subject_enron_event" = "Enron Event email subject",
"sum_email_business_process" = "Business process email text",
"sum_email_core_business" = "Core business email text",
"sum_email_meeting" = "Meeting email text",
"sum_email_enron_event" = "Enron's event email text")
Because the number of line which contain email description is lower than the length of the table the research of the keyword about Enron event in the email create many NA value. To be able to compute the sum of the email which contain those word we use the parameter na.rm = TRUE which consider the NA as it is a 0 in the data set to compute the sum.
#compute the sum of each topics for each month of each year study
email_subject_send_graph <- email_subject_send %>%
group_by(year_month) %>%
mutate(
sum_subject_meeting = sum(subject_meeting),
sum_subject_business_process = sum(subject_business_process),
sum_subject_core_business = sum(subject_core_business),
sum_subject_enron_event = sum(subject_enron_event),
#for the email we use na.rm = TRUE to allow the sum to be done
sum_email_business_process = sum(email_business_process, na.rm = TRUE),
sum_email_core_business = sum(email_core_business, na.rm = TRUE),
sum_email_meeting = sum(email_meeting, na.rm = TRUE),
sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
#keep one line per year and month
distinct(year_month, subject, reference, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event,
sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)
#display the different topic trend in the email subject over the study's period
email_subject_send_graph %>% select(year_month, starts_with("sum_subject_")) %>%
#change the orientation of the data set
pivot_longer(
cols = 2:5,
names_to = "topics",
values_to = "value") %>%
#scatter plot and trend line
ggplot(aes(year_month,value, color=topics))+
geom_line(size = 1)+
#label, axis, and legend
labs(color = "Email subject topics",
title = "Email subject analysis over the study period",
x = "Study period",
y = "Number of emails per topic") +
#to display the year and month, every 3 months for a better reading
scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
scale_color_manual(#to get only the customization for the email categories
values = topic_colors[1:4],
labels = topic_label[1:4])
We can see that:
The top topic is about the meeting; then we have the business process and the business core.
For the meeting, we have 3 spikes:
One between October 2000 and January 2001, maybe to organize the new year and close the past year.
One between April and July 2001, which is the period when the head of the company starts to worry about the business process.
The highest peak is between October 2001 and January 2002, the period when the fiscal fraud was discovered by the federal agency.
For the business process and core topics, we see 2 spikes which follow the last 2 spikes of the meeting topics. This suggests the topic of the meeting concerns the business. We could think those meetings are more related to the business process than the business core.
The emails about the Enron event are the fewest, but we can see a peak of the topic from October 2001 to around February 2002. This makes sense with the known event where the company was put in bankruptcy at this period.
For the email subject we look at the frequency of the word we search in them.
#the list of word research in the subject
word_list <- list("message","origin","pleas","email","thank","attach","file","copi","inform","receiv","thank","time","meet",
"look","week","dont","vinc","talk","enron","deal","agreement","chang","contract","corp","fax","houston","america",
"risk","analy","confidential","correction", "market","gas","price","power","company","energy","trade","busi","servic","manag",
"bankrup","SEC","MTM","fear", "investigation", "mark-to-market", "10-K", "losing money", "correction", "phone", "fax", "document", "testimony", "deposition", "witness")
#initiate a vector for registering their frequency
word_count <- c()
##iterate over the list and count the number of time we see each word in the list
for(i in seq_along(word_list)){
search <- as.character(word_list[[i]])
nb <- sum(str_count(email_subject_send_graph$subject, search))
word_count <- c(word_count, nb)
}
#draw a wordcloud which represent the words frequency
wordcloud(word_list, word_count, min.freq = 10 ,max.words=length(word_list), col=colorRampPalette(c("#cce5ff", "#3399ff", "#003366"))(length(word_list)), rot.per = 0.3)
title(main = "The top words seen in the email text", col.main = "black",font.main = 2)
To read the heatmap, the words that must be seen are those in dark blue and of the largest size. The words that are less frequently seen are in light blue and have the smallest size. This heatmap highlights the following:
The most frequently seen word in that list is ‘meet,’ which aligns with the fact that most email subjects are in the meeting topic category.
Additionally, there are many words related to the business processes at Enron, such as deal, agreement, change, and contract.
The smaller words are linked to the Enron event, such as bankruptcy, MTM, and SEC. This suggests that the email exchanges are not explicitly about the Enron event. We may find more related content within the email bodies.
#display the different topic trend in the email subject over the study's period
email_subject_send_graph %>% select(year_month, starts_with("sum_email_")) %>%
#change the orientation of the data set
pivot_longer(
cols = 2:5,
names_to = "email",
values_to = "value") %>%
#scatter plot and trend line
ggplot(aes(year_month,value, color=email))+
geom_line(size = 1)+
#label, axis, and legend
labs(color = "Email topics text",
title = "Email text analysis over the study period",
x = "Study period",
y = "Number of emails per topic") +
#to display the year and month, every 3 months for a better reading
scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
scale_color_manual(#to get only the customization for the email categories
values = topic_colors[5:8],
labels = topic_label[5:8])
In the email content, we can see that:
For all topics investigated, we find a peak of emails containing them from April 2001 to April 2002, which is related to a peak in email exchange as we saw earlier in this analysis. Additionally, this period is when the company was under SEC investigation and, in late 2001/early 2002, the bankruptcy process.
The emails mostly contain words about meetings. Then we find words related to business processes. Surprisingly, we don’t find many emails containing words linked with the Enron event. This suggests that the Enron events were communicated through other means such as fax and phone calls.
Like for the subject we can look at the frequency of each words in the email text:
#reduce the dataset to the row which contain email text
df_reference <- filter(email_subject_send_graph, !is.na(reference))
#initiate the list for storing the count for each words
email_words_freq <- c()
#loop allowing to extract the words in each email text and count the number of type they are found
for(i in seq_along(word_list)){
word <- as.character(word_list[[i]])
#we pass through a locate to return in a list the index of the row where we find them
counting <- as.list(str_locate(df_reference$reference, word))
#we count the index for which we don't have NA
nb <- sum(!is.na(counting))
#store the frequency for each words in the email text
email_words_freq <- c(email_words_freq, nb)
}
#draw the wordcloud with the frequency of each word
wordcloud(word_list, email_words_freq, min.freq = 10 ,max.words=length(word_list), col=colorRampPalette(c("#cce5ff", "#3399ff", "#003366"))(length(word_list)), rot.per = 0.3)
title(main = "The top words seen in the email text", col.main = "black",font.main = 2)
This heatmap is read like for the email subject, this one show us:
The top word are enron and please which are related to meeting and enron business process.
The word the must seen after that are relate to meeting (attach, inform, receiv). Then we find word link with the business process such as contract, chang, confidential. We find often the words fax and phone suggested in the email refer to phone call or fax which let us thinking they at this time communicate a lot through this way.
Then we look at the number of email received during the study period about those topics.
email_subject_rec <- df_message_status %>% distinct(date, year, month, recipient, status_recipient, subject, reference) %>%
mutate(#count the number of email which contain at least one word in the list of each topic
subject_meeting = if_else(str_detect(subject, topic_meeting), 1, 0),
subject_business_process = if_else(str_detect(subject, topic_business_process), 1, 0),
subject_core_business = if_else(str_detect(subject, topic_core_business), 1, 0),
subject_enron_event = if_else(str_detect(subject, topic_enron_event), 1, 0),
email_meeting = if_else(str_detect(reference,topic_meeting), 1, 0),
email_business_process = if_else(str_detect(reference, topic_business_process), 1, 0),
email_core_business = if_else(str_detect(reference, topic_core_business), 1, 0),
email_enron_event = if_else(str_detect(reference, topic_enron_event), 1, 0),
#to get the date in year/month
year_month = as.Date(paste0(year,"-",month,"-01")))
#compute the sum of each topics for each month of each year study
email_subject_rec_graph <- email_subject_rec %>%
group_by(year_month) %>%
mutate(
sum_subject_meeting = sum(subject_meeting),
sum_subject_business_process = sum(subject_business_process),
sum_subject_core_business = sum(subject_core_business),
sum_subject_enron_event = sum(subject_enron_event),
#for the email we use na.rm = TRUE to allow the sum to be done
sum_email_business_process = sum(email_business_process, na.rm = TRUE),
sum_email_core_business = sum(email_core_business, na.rm = TRUE),
sum_email_meeting = sum(email_meeting, na.rm = TRUE),
sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
#keep one line per year and month
distinct(year_month, subject, reference, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event,
sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)
#display the different topic trend in the email subject over the study's period
email_subject_rec_graph %>% select(year_month, starts_with("sum_subject_")) %>%
#change the orientation of the data set
pivot_longer(
cols = 2:5,
names_to = "topics",
values_to = "value") %>%
#scatter plot and trend line
ggplot(aes(year_month,value, color=topics))+
geom_line(size = 1)+
#label, axis, and legend
labs(color = "Email subject topics",
title = "Email received subject analysis over the study period",
x = "Study period",
y = "Number of emails per topic") +
#to display the year and month, every 3 months for a better reading
scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
scale_color_manual(#to get only the customization for the email categories
values = topic_colors[1:4],
labels = topic_label[1:4])
Here for the subject of the email received we distinct two spikes for each subject, the 1st from July, 2000 to July, 2001 and 2nd from August, 2001 to April, 2002. This 2 spikes are included in the 3 spikes seen in the email send. For the topics, we see the same pattern as for the email send.
#display the different topic trend in the email subject over the study's period
email_subject_rec_graph %>% select(year_month, starts_with("sum_email_")) %>%
#change the orientation of the data set
pivot_longer(
cols = 2:5,
names_to = "email",
values_to = "value") %>%
#scatter plot and trend line
ggplot(aes(year_month,value, color=email))+
geom_line(size = 1)+
#label, axis, and legend
labs(color = "Email text topics",
title = "Email received text analysis over the study period",
x = "Study period",
y = "Number of emails per topic") +
#to display the year and month, every 3 months for a better reading
scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
scale_color_manual(#to get only the customization for the email categories
values = topic_colors[5:8],
labels = topic_label[5:8])
For the email received about those topics/keywords we see a similar pattern than the email send suggesting their are exchange.
To go deeper in the email content analysis we next look at the topics and key words find in function of the worker status.
For that we create a similar data frame than the previous but by making the count of topics/email in function of the employee status.
status_email_subject_send <- email_subject_send %>%
#we focus on the worker which their status are know
filter(!is.na(status_sender)) %>%
#compute the sum of each topics for each year studied
group_by(year_month, status_sender) %>%
mutate(
sum_subject_meeting = sum(subject_meeting),
sum_subject_business_process = sum(subject_business_process),
sum_subject_core_business = sum(subject_core_business),
sum_subject_enron_event = sum(subject_enron_event),
#for the email we use na.rm = TRUE to allow the sum to be done
sum_email_business_process = sum(email_business_process, na.rm = TRUE),
sum_email_core_business = sum(email_core_business, na.rm = TRUE),
sum_email_meeting = sum(email_meeting, na.rm = TRUE),
sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
#keep one line per year and month
distinct(year_month, status_sender, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event,
sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)
#pivot the data frame
status_email_subject_send <- status_email_subject_send %>%
pivot_longer(
cols = 3:length(status_email_subject_send),
names_to = "topic_email",
values_to = "value")
#the list of status in the Enron company
status_list <- c("Employee", "CEO", "Manager", "Director", "Vice President", "Trader", "President", "Managing Director", "In House Lawyer")
#initiate the list to collect the plot
plot_list <- list()
#generating individual plot for each status
for(i in seq(status_list)){
#assign the status to the variable
status <- status_list[i]
#the plot related to that status
p <- status_email_subject_send %>% filter(status_sender == status) %>%
ggplot(aes(year_month, value, color = topic_email))+
geom_line(size = 1)+
labs(color = "Email topics (subject & text)",
title = paste("Email sent by", status, ", subject and text analysis"),
y = "Number of emails per topic",
x = "Study period")+
scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
scale_color_manual(values = topic_colors,
labels = topic_label)+
theme(legend.text.position = "bottom")
#append the plot list
plot_list[[i]] <- p
}
#display the plot created
n <- length(plot_list)
#number of plot per layout
plot_per_section <- 3
#loop create plot layouts
for (i in seq(1, n, by=plot_per_section)){
plot_on_the_page <- plot_list[i:min(i+2, n)]
#extract the legend from the first plot on the layout
legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
#remove the legend for all plot on the layout
no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
#display 4 plots per layout
grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
#combine together the 3 plot and one legend
plots_with_legend <- arrangeGrob(
grid_plot,
legend,
nrow = 2,
#arrange the plot and the legend in the layout
heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
)
#display everything together
grid.newpage()
grid.draw(plots_with_legend)
}
By analyzing the email subject and the email content in function of Enron’s worker status, we can see that:
Every status shows a peak of emails about those topics from April 2001 to January 2002. Also, the top topic for all is the meeting, followed by the business process. Moreover, the tendency we see for the email text is similar for the email’s subject.
The pattern of the emails sent by the employees follows the topics we saw for Enron’s workers previously. After emails about meetings, we see an important number of emails about the business process, with fewer about the core business. This could be linked with the investigation where employees send emails about the process they are involved in.
For the in-house lawyers, we can see two spikes of emails in 2001 regarding meetings and business processes. The first is from February 2001 to July 2001, and the second is from August 2001 to November 2001. These two periods are linked to the investigation by the SEC. We could think that these emails are for managing the investigation.
For the managing director, before June 2001, we can’t really distinguish any top topic in the email content and subject. After that, and until December 2001, we have a peak of emails talking about meetings, business processes, and core business. Here, both business topics seem to be at the same level. We see a similar tendency for the manager. We can think that, during this period, the managers have a lot of meetings to manage both sides of Enron’s businesses.
The traders send a significant number of emails about the core business and processes from July 2001 to March 2002. They speak a little about the Enron event.
Surprisingly, the CEO shows a significant peak of emails related to meetings, core business, and processes from December 2000 to May 2001, and then from November 2001 to January 2002. We can see a slight peak of emails speaking about the Enron event during these two periods, but the count for them is less than other statuses. This suggests they are not really involved in the email exchange during the SEC investigation, or less so than other Enron worker statuses. Perhaps, the email text we have isn’t exhaustive; maybe the emails about those events aren’t public, or most of this communication by the CEO is managed by other means such as phone calls and fax.
For other statuses at the head of the company (President and Vice-president), we can see that we have a peak of emails at the end of 2001 and the start of 2002. The highest peak, after the meeting topic, is linked to the business topics. Additionally, we see more emails that speak about the Enron event compared to the CEO. This suggests that they are more involved in the general management of the company as well as the Enron events than the CEO.
We do the same for the email received:
status_email_subject_rec <- email_subject_rec %>%
#we focus on the worker which their status are know
filter(!is.na(status_recipient)) %>%
#compute the sum of each topics for each year studied
group_by(year_month, status_recipient) %>%
mutate(
sum_subject_meeting = sum(subject_meeting),
sum_subject_business_process = sum(subject_business_process),
sum_subject_core_business = sum(subject_core_business),
sum_subject_enron_event = sum(subject_enron_event),
#for the email we use na.rm = TRUE to allow the sum to be done
sum_email_business_process = sum(email_business_process, na.rm = TRUE),
sum_email_core_business = sum(email_core_business, na.rm = TRUE),
sum_email_meeting = sum(email_meeting, na.rm = TRUE),
sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
#keep one line per year and month
distinct(year_month, status_recipient, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event,
sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)
#pivot the data frame
status_email_subject_rec <- status_email_subject_rec %>%
pivot_longer(
cols = 3:length(status_email_subject_rec),
names_to = "topic_email",
values_to = "value")
#initiate the list to collect the plot
plot_list <- list()
#generating individual plot for each status
for(i in seq(status_list)){
#assign the status to the variable
status <- status_list[i]
#the plot related to that status
p <- status_email_subject_rec %>% filter(status_recipient == status) %>%
ggplot(aes(year_month,value, color = topic_email))+
geom_line(size = 1) +
scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
labs(color = "Email topics (subject & text)",
title = paste("Email received by", status, ", subject and text analysis"),
y = "Number of emails per topic",
x = "Study period")+
scale_color_manual(values = topic_colors,
labels = topic_label)+
theme(legend.text.position = "bottom")
#append the plot list
plot_list[[i]] <- p
}
#display the plot created
n <- length(plot_list)
#number of plot per layout
plot_per_section <- 3
#create plot layouts
for (i in seq(1, n, by=plot_per_section)){
plot_on_the_page <- plot_list[i:min(i+2, n)]
#extract the legend from the first plot on the layout
legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
#remove the legend for all plot on the layout
no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
#display 4 plots per layout
grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
#combine together the 3 plot and one legend
plots_with_legend <- arrangeGrob(
grid_plot,
legend,
nrow = 2,
#arrange the plot and the legend in the layout
heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
)
#display everything together
grid.newpage()
grid.draw(plots_with_legend)
}
When we look at the emails received, we can see that: - The pattern for the emails received looks the same as the one for the emails sent, suggesting most are email exchanges about the same subject. In the emails received for every status, we can see more emails that speak about the Enron event, suggesting that people in the company are aware of what happened. However, these emails might contain information on what happened or directions to follow in response to potential questions from the investigators.
This email text and subject analysis highlight that different statuses inform about what happens in the company, from the processes used for the business to the management of the investigation as well as the bankruptcy. The head of the company seems to be more informed than active in the email exchange about the Enron event management. It seems that both business parts of the company could be more managed by the president and vice-president than the CEO. The in-house lawyers are more active in email exchange during the investigation by SEC and the bankruptcy, perhaps from a legal point of view.
Like for all the worker in the company we will look per status which are the words in the topics investigate which are the must see in the email subject or text. Here, we focus on the top 10 words find in both subject and text.
#Loop allowing to draw the wordcloud with the top 10 words find in email subject/text send by each status
for(i in seq_along(status_list)){
status <- status_list[i]
df <- email_subject_send %>%
#we focus on the worker which their status are know
filter(status_sender == status) %>%
#compute the sum of each topics for each year studied
group_by(year_month, status_sender) %>%
mutate(
sum_subject_meeting = sum(subject_meeting),
sum_subject_business_process = sum(subject_business_process),
sum_subject_core_business = sum(subject_core_business),
sum_subject_enron_event = sum(subject_enron_event),
#for the email we use na.rm = TRUE to allow the sum to be done
sum_email_business_process = sum(email_business_process, na.rm = TRUE),
sum_email_core_business = sum(email_core_business, na.rm = TRUE),
sum_email_meeting = sum(email_meeting, na.rm = TRUE),
sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
filter((sum_subject_meeting != 0) | (sum_subject_business_process != 0) | (sum_subject_core_business != 0) | (sum_subject_enron_event != 0) | (sum_email_business_process != 0) | (sum_email_core_business != 0) | (sum_email_meeting != 0) | (sum_email_enron_event != 0)) %>%
#keep one line per year and month
distinct(status_sender, subject, reference)
#initiate the liste for storing the count for each words in text and subject
email_words_freq <- c()
subject_freq <- c()
#loop allowing to extract the words in each email text and count the number of type they are found
for(j in seq_along(word_list)){
word <- as.character(word_list[[j]])
#count for the subject
counting_subject <- sum(str_count(df$subject, word))
subject_freq <- c(subject_freq, counting_subject)
#we pass through a locate to return in a list the index of the row where we find them
counting_text <- as.list(str_locate(df$reference, word))
#we count the index for which we don't have NA
nb <- sum(!is.na(counting_text))
#store the frequency for each words in the email text
email_words_freq <- c(email_words_freq, nb)
}
#for each status we make a total with the count from the subject and the text
total_count <- subject_freq + email_words_freq
#draw the wordcloud with the frequency of each word, only the top 10
wordcloud(word_list, total_count, min.freq = 10 ,max.words= 10,scale = c(3, 0.5) ,col=colorRampPalette(c("#cce5ff", "#3399ff", "#003366"))(length(word_list)), rot.per = 0.3)
title(main = paste0("Top 10 words in the email sent by ",status), col.main = "black", font.main = 2)
}
This last analysis for the email sent highlights that:
For all statuses, the top words are related to the meeting topics.
The employees and traders also speak about contracts, which we associate with the business process. Maybe this is because they are involved in this step of the Enron business.
The CEOs are the only status with more words related to business than meetings in their email subjects and texts. This suggests they send more emails about business compared to organizing meetings.
#Loop allowing to draw the wordcloud with the top 10 words find in email subject/text received by each status
for(i in seq_along(status_list)){
status <- status_list[i]
df <- email_subject_rec %>%
#we focus on the worker which their status are know
filter(status_recipient == status) %>%
#compute the sum of each topics for each year studied
group_by(year_month, status_recipient) %>%
mutate(
sum_subject_meeting = sum(subject_meeting),
sum_subject_business_process = sum(subject_business_process),
sum_subject_core_business = sum(subject_core_business),
sum_subject_enron_event = sum(subject_enron_event),
#for the email we use na.rm = TRUE to allow the sum to be done
sum_email_business_process = sum(email_business_process, na.rm = TRUE),
sum_email_core_business = sum(email_core_business, na.rm = TRUE),
sum_email_meeting = sum(email_meeting, na.rm = TRUE),
sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
filter((sum_subject_meeting != 0) | (sum_subject_business_process != 0) | (sum_subject_core_business != 0) | (sum_subject_enron_event != 0) | (sum_email_business_process != 0) | (sum_email_core_business != 0) | (sum_email_meeting != 0) | (sum_email_enron_event != 0)) %>%
#keep one line per year and month
distinct(status_recipient, subject, reference)
#initiate the list for storing the count for each words in text and subject
email_words_freq <- c()
subject_freq <- c()
#loop allowing to extract the words in each email text and count the number of type they are found
for(j in seq_along(word_list)){
word <- as.character(word_list[[j]])
#count for the subject
counting_subject <- sum(str_count(df$subject, word))
subject_freq <- c(subject_freq, counting_subject)
#we pass through a locate to return in a list the index of the row where we find them
counting_text <- as.list(str_locate(df$reference, word))
#we count the index for which we don't have NA
nb <- sum(!is.na(counting_text))
#store the frequency for each words in the email text
email_words_freq <- c(email_words_freq, nb)
}
#for each status we make a total with the count from the subject and the text
total_count <- subject_freq + email_words_freq
#draw the wordcloud with the frequency of each word, only the top 10
wordcloud(word_list, total_count, min.freq = 10 ,max.words= 10,scale = c(3, 0.5), col=colorRampPalette(c("#cce5ff", "#3399ff", "#003366"))(length(word_list)), rot.per = 0.3)
title(main = paste0("Top 10 words in the email received by ",status), col.main = "black", font.main = 2)
}
In the emails received, the top 10 words fall into the same topic categories as those in the sent emails. For the CEO, we observe more words about meetings compared to business matters in their top 10. This suggests that the CEO is well-informed about the content of the meetings, such as reports on various topics, but tends to give directions for the core business processes. This is logical given their position.
This analysis highlights that in the emails where the subject and/or text contains the words we searched for, associated with specific topics, the top words are related to meetings. This makes sense when we see the peak of those topics for each status. We could infer that these email exchanges are related to meetings for managing the Enron event as well as the business aspects of the company.
#global environment cleaning
rm(grid_plot, i, j, n, no_legend, p, p3, p4, plot_list, plot_on_the_page, plot_per_section, plots_with_legend, status, status_list,
status_email_subject, adjacencyData_99, adjacencyData_00, adjacencyData_01, adjacencyData_02, word, word_count, nb, legend, email_words_freq, counting, search, df, total_count, email_words_freq, subject_freq)
On the Enron scandal wikipedia page we find a list of person involved in the Enron scandal. We will research them in the data set to see if we can analyse the subject of the email they send as well as if they play a role in the Enron scandal. source: wikipedia page about Enron timeline downfall.
We find: - Kenneth Lay: he was the founder, chief executive officer, and the chairman of Enron and was heavily involved in Enron’s scandal.
Jeffrey Skilling: he was the CEO of the company during the scandal and deeply involved in the fraud.
Andrew Fastow: he was the chief financial officer and was fired shortly before the bankruptcy.
Lea Fastow: she was the secretary of treasure in Enron and the wife of Andrew Fastow.
Timothy Belden: he was the head of trading in Enron company.
Vincent Kaminski: he work in Enron as the head of the quantitative modelling group.
Jordan Mintz: he is a former managing director for the corporate tax at Enron
Sherron Watkins: she was one of the vice-president in Enron
Richard Causey: he was an accounting officer of Enron
Greg Whalley: he was an enron executive.
From this list we add Jeff Dasovich who isn’t find in the wikipedia page but we find it to be the most active employee in the email sending. Maybe, he could be participate at some exchange related to the Enron’s events.
#to find the person involved in the fiscal fraud we use str_detect to see if we can find them in the data set
#for example here for Vincent Kaminski
people_of_interest <- df_message_status%>% filter(str_detect(sender,"kaminski"))
First we construct the data set for the email send and received by each Enron worker know for being involved in the fraud.
#email send:
person_of_interest_send <- email_subject_send %>%
filter(str_detect(sender,"jeff.dasovich|andrew.baker|tim.belden|andrew.fastow|lfastow|vkaminski|jordan.mintz|jeff.skilling|sherron.watkins|richard.causey|greg.whalley")) %>%
mutate(
#identify the person who sent the email
email_label_sender = case_when(
sender == "jeff.dasovich@enron.com" ~ "Jeff Dasovich",
sender == "kenneth.lay@enron.com" ~ "Kenneth Lay",
sender == "jeff.skilling@enron.com" ~ "Jeffrey Skilling",
sender == "andrew.baker@enron.com" ~ "Andrew Baker",
sender == "tim.belden@enron.com" ~ "Timothy Belden",
sender %in% c("lfastow@pop.pdq.net", "lfastow@pdq.net") ~ "Lea Fastow",
sender == "andrew.fastow@enron.com" ~ "Andrew Fastow",
sender %in% c("vkaminski@enron.com", "vkaminski@aol.com", "vkaminski@palm.net") ~ "Vincent Kaminski",
sender == "jordan.mintz@enron.com" ~ "Jordan Mintz",
sender == "sherron.watkins@enron.com" ~ "Sherron Watkins",
sender == "richard.causey@enron.com" ~ "Richard Causey", #chief account officer wikipedia source
sender == "greg.whalley@enron.com" ~ "Greg Whalley", #president and COO of Enron wholesale service
.default = sender))
#email received
person_of_interest_reciveid <- email_subject_rec %>%
filter(str_detect(recipient,"jeff.dasovich|andrew.baker|tim.belden|andrew.fastow|lfastow|vkaminski|jordan.mintz|jeff.skilling|sherron.watkins|richard.causey|greg.whalley")) %>%
mutate(
#identify the person who sent the email
email_label_recipient =
case_when(
recipient %in% c("jeff.dasovich@enron.com","jeff_dasovich@ees.enron.com") ~ "Jeff Dasovich",
recipient == "kenneth.lay@enron.com" ~ "Kenneth Lay",
recipient %in% c("jeff.skilling@enron.com","jeff_skilling@enron.com") ~ "Jeffrey Skilling",
recipient == "andrew.baker@enron.com" ~ "Andrew Baker",
recipient %in% c("tim.belden@enron.com", "tim_belden@pgn.com") ~ "Timothy Belden",
recipient %in% c("lfastow@pop.pdq.net", "lfastow@pdq.net") ~ "Lea Fastow",
recipient %in% c("andrew.fastow@enron.com", "andrew.fastow@ljminvestments.com") ~ "Andrew Fastow",
recipient %in% c("vkaminski@enron.com", "vkaminski@aol.com","vkaminski@aol .com", "vkaminski@palm.net",
"vkaminski@ol.com", "vkaminski@aol .com", "vkaminski@aol .com") ~ "Vincent Kaminski",
recipient %in% c("jordan.mintz@enron.com","jordan_mintz@enron.com") ~ "Jordan Mintz",
recipient == "sherron.watkins@enron.com" ~ "Sherron Watkins",
recipient == "richard.causey@enron.com" ~ "Richard Causey", #chief account officer wikipedia source
recipient == "greg.whalley@enron.com" ~ "Greg Whalley", #president and COO of Enron wholesale service
.default = recipient))
We look at the number of email send/received for each person studied:
The email send each month:
#create a list with the name of each person we want to study
enron_worker_send <- unique(person_of_interest_send$email_label_sender)
#initiate the list to store the plot
worker_send_plot <- list()
#loop allowing to construct a bar plot to display per month the number of email send by each person study
for(i in seq(enron_worker_send)){
worker <- enron_worker_send[i]
p <- person_of_interest_send %>% filter(email_label_sender == worker) %>%
group_by(year,month) %>%
count() %>%
#bar plot
ggplot(aes(month, n, fill = month))+
geom_bar(stat = "identity") +
facet_grid(~year)+
labs(title = paste("Email sent per month for each year by", worker),
y = "Number of emails")+
scale_fill_manual(
values = month_color,
labels = month_label)+
theme(legend.position = "bottom",
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank())
worker_send_plot[[i]] <- p}
worker_send_plot
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
The email received each month:
#liste of person study
enron_worker_rec <- unique(person_of_interest_reciveid$email_label_recipient)
#loop allowing to construct a bar plot to display per month the number of email received by each person study
worker_rec_plot <- list()
for(i in seq(enron_worker_rec)){
worker <- enron_worker_rec[i]
p <- person_of_interest_reciveid %>% filter(email_label_recipient == worker) %>%
group_by(year,month) %>%
count() %>%
#bar plot
ggplot(aes(month, n, fill = month))+
geom_bar(stat = "identity") +
facet_grid(~year)+
labs(title = paste("Email received per month for each year by", worker),
y = "Number of emails")+
scale_fill_manual(
values = month_color,
labels = month_label)+
theme(legend.position = "bottom",
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank())
worker_rec_plot[[i]] <- p}
worker_rec_plot
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
When we look at the number of emails received/sent by Enron workers known for being involved in the Enron event, we can see they sent fewer emails than they received. Moreover, the pattern of each follows the general pattern of the workers in the Enron company. For all, we find principally the emails are sent or received in 2001. By adding Jeff Dasovich, whom we identified earlier to be the most active sender, we see that he is the most active in this group of people working at Enron. The least active senders in this group are Sherron Watkins, Andrew Baker, Andrew, and Laura Fastow.
Then we look at the number of email send about the topics and key words we have identify.
#extract the worker who are interesting to follow and compute the number of email send by them
person_of_interest_send_subject <- person_of_interest_send %>%
#to compute the number of email sent in each topics by the person whose are directly involved in the Enron scandal
group_by(year_month, email_label_sender) %>%
mutate(
sum_subject_meeting = sum(subject_meeting),
sum_subject_business_process = sum(subject_business_process),
sum_subject_core_business = sum(subject_core_business),
sum_subject_enron_event = sum(subject_enron_event),
#for the email we use na.rm = TRUE to allow the sum to be done
sum_email_business_process = sum(email_business_process, na.rm = TRUE),
sum_email_core_business = sum(email_core_business, na.rm = TRUE),
sum_email_meeting = sum(email_meeting, na.rm = TRUE),
sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
#keep one line per year and month
distinct(year_month, email_label_sender, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event,
sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)
#pivot the table
person_of_interest_send_subject <-person_of_interest_send_subject %>%
pivot_longer(
cols = 3:length(person_of_interest_send_subject),
names_to = "topic_email",
values_to = "value"
)
For each Enron’s worker know for being involved in the different Enron’s events we will look at the number of email by create a bar plot to follow the evolution of the topics discuss over the period of study
#initiate the list to collect the plot
plot_list <- list()
#generating individual plot for each status
for(i in seq(enron_worker_send)){
#assign the status to the variable
worker <- enron_worker_send[i]
#the plot related to that status
p <- person_of_interest_send_subject %>% filter(email_label_sender == worker) %>%
ggplot(aes(year_month,value, color = topic_email))+
geom_line(size = 1) +
labs(color = "Email topics (subject & text)",
title = paste("Email sent by", worker, "subject and text analysis"),
y = "Number of emails per topic",
x = "Study period")+
scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
scale_color_manual(values = topic_colors,
labels = topic_label)+
theme(legend.text.position = "bottom")
#append the plot list
plot_list[[i]] <- p
}
#display the plot created
n <- length(plot_list)
#number of plot per layout
plot_per_section <- 3
#create plot layouts
for (i in seq(1, n, by=plot_per_section)){
plot_on_the_page <- plot_list[i:min(i+2, n)]
#extract the legend from the first plot on the layout
legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
#remove the legend for all plot on the layout
no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
#display 4 plots per layout
grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
#combine together the 3 plot and one legend
plots_with_legend <- arrangeGrob(
grid_plot,
legend,
nrow = 2,
#arrange the plot and the legend in the layout
heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
)
#display everything together
grid.newpage()
grid.draw(plots_with_legend)
}
We can see that:
Jeff Dasovich is the most active Enron worker in this shortlist for sending emails. He sends emails about all topics, especially meetings and various business aspects. He could be one of the employees involved in different events and/or managing them. Perhaps he holds a high level of responsibility within the company.
The other workers at Enron are pointed out to be involved in the events, but they send fewer emails about these topics (no more than 15). This could be because the email text data aren’t exhaustive, and many of their emails about these topics are censored for the public.
All of them send emails about meetings, core business, and Enron events. Surprisingly, we don’t find words associated with the core business at Enron. Perhaps these individuals are more active in the business processes than in the regular affairs of the company.
Next we look at the number of email received about those topics
#extract the worker who are interesting to follow and compute the number of email send by them
person_of_interest_reciveid_subject <- person_of_interest_reciveid %>%
#to compute the number of email sent in each topics by the person whose are directly involved in the Enron scandal
group_by(year_month, email_label_recipient) %>%
mutate(
sum_subject_meeting = sum(subject_meeting),
sum_subject_business_process = sum(subject_business_process),
sum_subject_core_business = sum(subject_core_business),
sum_subject_enron_event = sum(subject_enron_event),
#for the email we use na.rm = TRUE to allow the sum to be done
sum_email_business_process = sum(email_business_process, na.rm = TRUE),
sum_email_core_business = sum(email_core_business, na.rm = TRUE),
sum_email_meeting = sum(email_meeting, na.rm = TRUE),
sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
#keep one line per year and month
distinct(year_month, email_label_recipient, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event,
sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)
#pivot the table
person_of_interest_reciveid_subject <-person_of_interest_reciveid_subject %>%
pivot_longer(
cols = 3:length(person_of_interest_reciveid_subject),
names_to = "topic_email",
values_to = "value"
)
Display the email received about those topics for each Enron’s worker knows to be imply in the Enron events
#initiate the list to collect the plot
plot_list <- list()
#generating individual plot for each status
for(i in seq(enron_worker_rec)){
#assign the status to the variable
worker <- enron_worker_rec[i]
#the plot related to that status
p <- person_of_interest_reciveid_subject %>% filter(email_label_recipient == worker)%>%
ggplot(aes(year_month,value, color = topic_email))+
geom_line(size = 1) +
scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
labs(color = "Email topics (subject & text)",
title = paste("Email received by", worker, "subject and text analysis"),
y = "Number of emails per topic",
x= "Study period")+
scale_color_manual(values = topic_colors,
labels = topic_label)+
theme(legend.text.position = "bottom")
#append the plot list
plot_list[[i]] <- p
}
#display the plot created
n <- length(plot_list)
#number of plot per layout
plot_per_section <- 3
#create plot layouts
for (i in seq(1, n, by=plot_per_section)){
plot_on_the_page <- plot_list[i:min(i+2, n)]
#extract the legend from the first plot on the layout
legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
#remove the legend for all plot on the layout
no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
#display 4 plots per layout
grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
#combine together the 3 plot and one legend
plots_with_legend <- arrangeGrob(
grid_plot,
legend,
nrow = 2,
#arrange the plot and the legend in the layout
heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
)
#display everything together
grid.newpage()
grid.draw(plots_with_legend)
}
We can observe that:
All received more emails which are about or speak about the Enron event, and both business parts show they are more informed than active in the email exchange about those topics. This is true for everyone except Jeff Dasovich, who received and sent a similar number of emails related to those topics.
Timothy Belden and Vincent Kaminski, after the meeting topic, received more emails about the business process compared to other topics. This may be due to their roles in the company and suggests they are the most informed in this group about the business process.
From this analysis, we can deduce that Jeff Dasovich is highly active in the email exchanges on all the topics investigated here. The other person for whom we looked at the email subject and content seems to be more passive than active in the email exchange. In fact, they send few emails about those topics compared to the number they received. In the emails received, an important part concerns the business process as well as meetings. This suggests that these persons are aware of how the company manages its business and maybe participate in meetings about them.
When we start to explore the data set we pointed that, it as average 1% of the email exchange where the sender and the receiver haven’t a Enron email address. Potential those person are external to the company and could speak about the event. We can imagine that, external person involved in internal email exchange could speak about what does the Enron worker in the company with external person. In this part we will explore this hypothesis.
#extraction of the email exchange whose not involved the enron worker
extern_email <- df_message_status %>% select(date, year, month, sender, recipient, subject, reference) %>%
#count for each the sender and recipient whose have an enron email address
mutate(count_sender = if_else(str_detect(sender, "@enron"), 1, 0),
count_recipient = if_else(str_detect(recipient, "@enron"), 1,0)) %>%
#for each date and subject for each date make the sum of the sender and recipient with an enron email address
group_by(date, subject) %>% mutate(
sum_sender = sum(count_sender),
sum_recipient = sum(count_recipient)) %>% ungroup() %>%
#isolate the email exchange which not involved person with an enron email address
filter((sum_sender ==0) & (sum_recipient == 0)) %>% select(-c(count_sender, count_recipient, sum_sender, sum_recipient)) %>%
#transform all the string variable into factor data type
transform(sender = as.factor(sender),
recipient = as.factor(recipient))
summary(extern_email)
## date year month
## Min. :1999-09-19 1999: 870 10 :4347
## 1st Qu.:2000-12-03 2000: 6879 11 :4209
## Median :2001-05-25 2001:15653 12 :3818
## Mean :2001-05-10 2002: 1810 09 :2620
## 3rd Qu.:2001-10-26 05 :1811
## Max. :2002-12-21 04 :1696
## (Other):6711
## sender
## owner-eveningmba@haas.berkeley.edu: 910
## naftcorp@aol.com : 897
## jbennett@gmssr.com : 889
## berk@haas.berkeley.edu : 871
## duggar@haas.berkeley.edu : 761
## feedback@intcx.com : 611
## (Other) :20273
## recipient
## Undisclosed-Recipient : 838
## eveningmba@haas.berkeley.edu: 431
## soblander@carrfut.com : 372
## tie_list_server@nyiso.com : 283
## marketing@nymex.com : 275
## linguaphile@wordsmith.org : 265
## (Other) :22748
## subject
## Quantitative Finance Update from FinMath.com @ Chicago : 897
## NYS Reliability Council Executive Committee : 515
## Brief of Enron Energy Service Inc. on Rate Design -- A. 00-11-038 : 445
## looking for key players to form a founding team of startup : 298
## Comments of Enron Energy Services on Proposed and Alternate Decis\tions -- A. 00-11-038, et al.: 230
## Errata To the Rate Design Testimony of Enron Energy Services Inc. : 214
## (Other) :22613
## reference
## Length:25212
## Class :character
## Mode :character
##
##
##
##
By looking at the data summary we can see that:
those email seems to be send mostly in 2001 because the median is 2001-05-10 and the 3rd quantile is 2001-10-26.
the email address for the sender who appear the most is with a domain of the berkley university. For the recipient we don’t know the email address of the top receiver.
on the top subject we can see that 2 of them speak about enron.
This let us think we could investigate more in this email exchange to see if they speak to the Enron event. For that we use the same topic and key word as in the main table.
extern_email_graph <- extern_email %>% distinct(date, year, month, sender, recipient, subject, reference) %>%
#filter for the email having in their subject enron
filter(str_detect(subject, "enron|Enron") | str_detect(reference, "enron|Enron")) %>%
mutate(#count the number of email which contain at least one word in the list of each topic
subject_meeting = if_else(str_detect(subject, topic_meeting), 1, 0),
subject_business_process = if_else(str_detect(subject, topic_business_process), 1, 0),
subject_core_business = if_else(str_detect(subject, topic_core_business), 1, 0),
subject_enron_event = if_else(str_detect(subject, topic_enron_event), 1, 0),
email_meeting = if_else(str_detect(reference,topic_meeting), 1, 0),
email_business_process = if_else(str_detect(reference, topic_business_process), 1, 0),
email_core_business = if_else(str_detect(reference, topic_core_business), 1, 0),
email_enron_event = if_else(str_detect(reference, topic_enron_event), 1, 0),
#to get the date in year/month
year_month = as.Date(paste0(year,"-",month,"-01"))) %>%
group_by(year_month) %>%
mutate(
sum_subject_meeting = sum(subject_meeting),
sum_subject_business_process = sum(subject_business_process),
sum_subject_core_business = sum(subject_core_business),
sum_subject_enron_event = sum(subject_enron_event),
#for the email we use na.rm = TRUE to allow the sum to be done
sum_email_business_process = sum(email_business_process, na.rm = TRUE),
sum_email_core_business = sum(email_core_business, na.rm = TRUE),
sum_email_meeting = sum(email_meeting, na.rm = TRUE),
sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
#keep one line per year and month
distinct(year_month, subject, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event,
sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)
#graph of the email speaking about enron and which could be speaking about enron event/business process
extern_email_graph %>% select(-subject) %>%
#change the orientation of the data set
pivot_longer(
cols = 2:9,
names_to = "topics",
values_to = "value") %>%
#scatter plot and trend line
ggplot(aes(year_month,value, color=topics))+
geom_line(size = 1)+
#label, axis, and legend
labs(color = "Email topics (subject & text)",
title = "Email subject and text about enron event",
subtitle = "Email exchange about Enron between person whose haven't an enron email address",
x = "Study period",
y = "Number of emails per topic") +
#to display the year and month, every 3 months for a better reading
scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
scale_color_manual(#to get only the customization for the email categories
values = topic_colors,
labels = topic_label)
The graph above show us that, some email between person without enron email address exchange about the Ernon event especially their business process, less speak about the core business of the company. Those email are mostly send between october 2001 and January 2002 which is the period of the Enron fraud investigation by the SEC. Inside the email content we don’t find the key words related to those events.
#isolate the subject about enron and their event
Enron_subject <- extern_email_graph %>%
filter(str_detect(subject, "enron|Enron")) %>%
filter((sum_subject_meeting != 0) | (sum_subject_business_process != 0) | (sum_subject_core_business != 0) | (sum_subject_enron_event != 0)) %>% distinct(year_month, subject, .keep_all = TRUE)
#drop the line whose seems to be extern exchange
no_extern <- df_message_status %>% select(date, sender, recipient, subject, reference) %>%
#count for each the sender and recipient whose have an enron email address
mutate(count_sender = if_else(str_detect(sender, "@enron"), 1, 0),
count_recipient = if_else(str_detect(recipient, "@enron"), 1,0)) %>%
#for each date and subject for each date make the sum of the sender and recipient with an enron email address
group_by(date, subject) %>% mutate(
sum_sender = sum(count_sender),
sum_recipient = sum(count_recipient)) %>% ungroup() %>%
#isolate the email exchange which not involved person with an enron email address
filter((sum_sender !=0) | (sum_recipient != 0)) %>% select(-c(count_sender, count_recipient, sum_sender, sum_recipient)) %>%
#transform all the string variable into factor data type
transform(sender = as.factor(sender),
recipient = as.factor(recipient))
#inner joint with the main table to see if we can find those subject in exchange between enron employee
print(verify <- inner_join(no_extern, Enron_subject, by = "subject"))
## date sender recipient
## 1 2002-01-04 david.forster@enron.com louise.kitchen@enron.com
## 2 2001-12-07 louise@enron.com louise@enron.com
## subject
## 1 EnronOnline Documents
## 2 NYTimes.com Article: Enron Paid Out Retention Bonuses Before Bankruptcy Filing
## reference year_month sum_subject_meeting sum_subject_business_process
## 1 <NA> 2001-12-01 0 0
## 2 <NA> 2001-12-01 0 0
## sum_subject_core_business sum_subject_enron_event sum_email_business_process
## 1 1 1 1
## 2 1 1 1
## sum_email_core_business sum_email_meeting sum_email_enron_event
## 1 0 1 1
## 2 0 1 1
We can see that 2 subject are find in the external and the data set which look only at the exchange involving person with an enron email address. Those email are send in december 2001 and January 2002, one is from the CEO david foster and is about enron online document, the second is from a louise at enron and is related to an article about the bankruptcy at enron. We can think that, those email had involved person whose are external too the enron company and have spread those information outside the company.
To conclude on the project, we can say that: The Enron company is composed of different statuses which seem to have varying degrees of involvement in the fiscal fraud. The person at the head of the company, as well as the traders and the lawyers, seem to be active participants in the fraud. The other statuses seem to be more aware of it, perhaps not playing a significant role in it. By looking at the people known to be involved in the Enron fiscal fraud, we do not identify many emails sent or received about it, nor about the management of the bankruptcy or the SEC investigation. We can assume they used other means of communication. Given the time, they might have communicated more by phone than email. A brief investigation about potential external exchanges shows that other companies in the US spoke about the Enron event and two emails are directly associated with company internal exchanges. It could be interesting to investigate the email content further by having a more exhaustive dataset about them. This will enhance the knowledge of the Enron event as well as the implication of the different statuses in them.